Message ID | 20230621235508.113949-1-suravee.suthikulpanit@amd.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp4721550vqr; Wed, 21 Jun 2023 16:57:04 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ71OGgMtV1wSANJxP8WnkDOSV26tM58jCuvgHRcS3hXSqGVnqyTl6rIumIrFOKI398YKN/U X-Received: by 2002:a05:6a00:1acd:b0:658:d819:8890 with SMTP id f13-20020a056a001acd00b00658d8198890mr13511591pfv.5.1687391823753; Wed, 21 Jun 2023 16:57:03 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1687391823; cv=pass; d=google.com; s=arc-20160816; b=IwQCjnBKfnibaM+sm5jbHC/PMuitEZmM9worcC42UNeQ1E7Mqx1oAE5y+C7TCjEfvD qbiNu0Ah5ktFbexoK8gbdaJCbzGsamvwscDub79zFURA7+M0usk513q3nYmc6xotwl30 zKmAAI2P40Wvv8+W+bj4NtvAUWWPfZGnQmx1j7rrPpnWnkfrHJf3EHyMxLQ+l+GiPBQQ iIIqRmZ/dg9bg6vsGbG6xe4oLcZFu/xnPKOeojeBxLkP3RpdKUhhO3TJAlj+Wq6XR+Mv y+e2b1W8uhwBiEutpv7jA4mSybeTGXalJ5Thi10o/iVXLQpFzZDjvbY+Tsnq//mcSEO6 uZdg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=1V7cgQ0ZlJJZewJC/fZEaolzYKqQGTsKCFXRog8HCec=; b=zBi2vDOmVtgXvElpnbDMKrNqNdGNzXOnv02G0cstijxfmPJ0CGFxr5vZjL8MNKm26d 9GfFIOPQr/qiWQhx/RNvJ6im5LZpqeVzz4KC6WIG+jjTelxodv06lVXQzFoZz9RvseN3 Lv4VE/qJeZprXODg/DsTEhQ9g0NRjD4L/MZQQYB8nc+qyS59B6XeWqGIQdADfieE3Kpz jjLr3NJtkH1wu/5MvlxOJDH+btXohvUPL8ng+mVnCDypPE9nLCx3FzRz6htdGIvf3aYk 916qmeGj009+dfE8+z2DuA2b0VEQdT2vshvBUjpH4LcYUU431+AoOT0AaqisHI5NjT6n QuSA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=sMMaNMiw; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u64-20020a638543000000b0053b8c964128si5225662pgd.708.2023.06.21.16.56.50; Wed, 21 Jun 2023 16:57:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=sMMaNMiw; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229944AbjFUXzt (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Wed, 21 Jun 2023 19:55:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229504AbjFUXzr (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 21 Jun 2023 19:55:47 -0400 Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2041.outbound.protection.outlook.com [40.107.236.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B1311733; Wed, 21 Jun 2023 16:55:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ecVmX3xBpyOcn/1ciknqczT53Xr3wYQr7QR7sgGYaotyYxAsbYXqktgj/edbGUtIVkLGSgEwomc6A/ppNYvS/oFO6EkDpz9czeZYxZ6m1ETNquuXZ6EXA0k4dUjGqsT/t9OVHO/QYLsgiWfKM8ONkkDXshFFnjlMVnRfxeCFjeyUTvKW4ANNaTJFdmBd52KkmC331xYMfM4LJ9OZarOPZmR1meJhJ9A7MKUQQOLnheHxUgezWRGHIsw2iffFB79qg4WW8Dway5N2G9B/WxEUwor9IWD7p4FYDBXeAdhEA1ZRnrC+gZb/eb8OLPKMhFzm1RLVKhsanjJujHt2V9or5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=1V7cgQ0ZlJJZewJC/fZEaolzYKqQGTsKCFXRog8HCec=; b=Lvlkbj8Sf17fkYt9C77DRmglJFyWbGJQswXzcIrzGYIid3gDFA+izmqtmJIFLa0DoWa1DOmTpN7LeaYEjeCzxDo1S63J42rauzZu+QoB9R8wVPEpofb98RJLSQC+5jKtKKcrfWgPJKmwE1tjmBqlFz7uO1Krhhgnjo1Ph7lOH7U84LU/TUqgAAvo7M0idWk7REJmsn5jqyV7VbwQ7uGkwiWY4cj2Gja+y9Ou+6aFOmhJs7aAUK/2EH92RblQuBbvmNBeFXqcCqHmKQBJ2gxMXym/DTZw2u0sA4cpTlKe8BGfmbYToM0HMJHFUbv7ADwcPKAKyIa6DlH7MCkeg9cu3w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=1V7cgQ0ZlJJZewJC/fZEaolzYKqQGTsKCFXRog8HCec=; b=sMMaNMiwv81lvWmcnsxNoAhGsiZ1mGDTqCcBHK9l4gM447ybRT/c66EZiR6WlBY77Ey8Mm1ur9ATBdPS6MQtXqGkSHhyML8t2J3ZCN0PaFk65sPUOWGSKLn1wqcrkuYLCi9uokcZjhSwzQq0xNl3IS/UACKkT0WiUdKlSp9Fwu0= Received: from CYZPR10CA0007.namprd10.prod.outlook.com (2603:10b6:930:8a::24) by CH2PR12MB4200.namprd12.prod.outlook.com (2603:10b6:610:ac::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6521.21; Wed, 21 Jun 2023 23:55:43 +0000 Received: from CY4PEPF0000E9CD.namprd03.prod.outlook.com (2603:10b6:930:8a:cafe::f0) by CYZPR10CA0007.outlook.office365.com (2603:10b6:930:8a::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6521.23 via Frontend Transport; Wed, 21 Jun 2023 23:55:43 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by CY4PEPF0000E9CD.mail.protection.outlook.com (10.167.241.140) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6521.24 via Frontend Transport; Wed, 21 Jun 2023 23:55:42 +0000 Received: from ruby-95f9host.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Wed, 21 Jun 2023 18:55:41 -0500 From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> To: <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev>, <kvm@vger.kernel.org> CC: <joro@8bytes.org>, <robin.murphy@arm.com>, <yi.l.liu@intel.com>, <alex.williamson@redhat.com>, <jgg@nvidia.com>, <nicolinc@nvidia.com>, <baolu.lu@linux.intel.com>, <eric.auger@redhat.com>, <pandoh@google.com>, <kumaranand@google.com>, <jon.grimm@amd.com>, <santosh.shukla@amd.com>, <vasant.hegde@amd.com>, <jay.chen@amd.com>, <joseph.chung@amd.com>, "Suravee Suthikulpanit" <suravee.suthikulpanit@amd.com> Subject: [RFC PATCH 00/21] iommu/amd: Introduce support for HW accelerated vIOMMU w/ nested page table Date: Wed, 21 Jun 2023 18:54:47 -0500 Message-ID: <20230621235508.113949-1-suravee.suthikulpanit@amd.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CY4PEPF0000E9CD:EE_|CH2PR12MB4200:EE_ X-MS-Office365-Filtering-Correlation-Id: b2b22664-799b-40c1-cdea-08db72b3061f X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 81mj8TGZ0sXM13Ac98NNyRFxXC7khBJBbwfWPANucpIxeSd05ui+Cy86W+0Xa1OyYa4owUu5iConmXeBJYx0SYUAetgOtt/j7OeK457ly8PUE/JdE/P88NEa3B72rV11HSIZDU6em0wWyljMKOpnxXg9XZoXrYRqcSafkSOkt4nySp75cPJmWkzLJKWQQXw0loGkaKpCBuAHJzWUcfcKj2n0zk11kRBJE/0o6QRp20DQsPeIXEBoXuKEXbEHBdsvyd69LXacyCc0wFKjGNf0Y7WcKiHSBg4i9Ou88khFEQs0GeRs1L5mjqfbAm3V8L8ic6Y9StUdEd2bb5DWOR+T0fdrAMgkDCA7N9n3sNx/xyEAEKs2zkT9+OjLNyavlWxf/5SRsF39U4+b0Xl0caWCCDTxOFwN1vg2+jv9O63PAD0g3hTsnlBvBjtjdADFz3zdPx/Hq6uued61qVK2pgcvaF7cbEsegH6vrMWxIHJIHZMZFUiAjI2UsafolESAMY2SSRU5ycBX1gk4MjwRaIc3GPlcbrlVREeni05l7lCnPHLS6dt/zEgZRJ8R+b7u26OKqBWPvxESxrC0IN6q9YeQZodK8i+/SB2TuLG8Tg1uiNj1k0QeJgWFoI9tNyMBpizDUaFcv0TLPla1nUR+13PNwLFxKI6PBcgjAkNYFD+M/bOkPhR1Gb6osQ1b2cO/OZmM3daJK3BO5X1SZYxghy+EKHYQlxfqi2jwZ+xclz3gqzpk4+NMsTXXG2jQKHWrZ6ad9qPB5hTInM8+xxP6dTRtViPpyxSIwdi6Pz1/7RbAs8Y= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(136003)(346002)(376002)(396003)(39860400002)(451199021)(46966006)(40470700004)(36840700001)(82310400005)(336012)(426003)(83380400001)(16526019)(4326008)(41300700001)(8936002)(8676002)(5660300002)(7416002)(70206006)(70586007)(316002)(2616005)(40460700003)(36756003)(86362001)(26005)(186003)(1076003)(82740400003)(47076005)(36860700001)(478600001)(2906002)(7696005)(966005)(44832011)(81166007)(40480700001)(356005)(110136005)(54906003)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 21 Jun 2023 23:55:42.9579 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b2b22664-799b-40c1-cdea-08db72b3061f X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CY4PEPF0000E9CD.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH2PR12MB4200 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769358568705225264?= X-GMAIL-MSGID: =?utf-8?q?1769358568705225264?= |
Series |
iommu/amd: Introduce support for HW accelerated vIOMMU w/ nested page table
|
|
Message
Suravee Suthikulpanit
June 21, 2023, 11:54 p.m. UTC
OVERVIEW -------- AMD IOMMU Hardware Accelerated Virtualized IOMMU (HW-vIOMMU) feature provides partial hardware acceleration for implementing guest IOMMUs. When the feature is enabled, the following components are virtualized: * Guest Command Buffer * Guest Event Log (work-in-progress) * Guest PPR Log (work-in-progress)) In addition, this feature can be used in combination with nested IOMMU page tables to accelerated address translation from GIOVA to GPA. In this case, the host page table (a.k.a stage2 or v1) is managed by the hypervisor (i.e. KVM/VFIO) and the guest page table (a.k.a stage1 or v2) is managed by the guest IOMMU driver (e.g. when booting guest kernel with amd_iommu=pgtable_v2 mode). Since the IOMMU hardware virtualizes the guest command buffer, this allows IOMMU operations to be accelerated such as invalidation of guest pages (i.e. stage1) when the command is issued by the guest kernel without intervention from the hypervisor. This series is implemented on top of the IOMMUFD framework. It leverages the exisiting APIs and ioctls for providing guest iommu information (i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page table information (i.e. struct iommu_hwpt_amd_v2) for setting up user domain. Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU. NOTES ----- This series is organized into two parts: * Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8). * Part2: Introducing HW-vIOMMU support (Patch 9-21). * Patch 12 and 21 extends the existing IOMMUFD ioctls to support additional opterations, which can be categorized into: - Ioctls to init/destroy AMD HW-vIOMMU instance - Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance. - Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance. - Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses. - Ioctls to trap AMD HW-vIOMMU command buffer initialization. Since these are specific to AMD HW-vIOMMU implementation but still want to leverage /dev/iommu, they are separated from existing VFIO-related ioctls. * Initial revision only supports 1 pasid in the guest (i.e. pasid 0). Multiple pasids support will be added in subsequent revision. GITHUB ------ * Working Linux kernel prototype of this series [1] is based on [3]. * This sereis is tested with QEMU [2] (work-in-progress) REFERENCES ---------- [1] Linux Github branch for this series https://github.com/AMDESE/linux/tree/wip/iommufd_nesting-06192023-yi_amd_viommu_20230621 [2] QEMU Github branch to be used for testing this series. https://github.com/AMDESE/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%2Bnesting_amd_viommu_202300621 [3] Base Github branch from Yi Lui. https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-06192023-yi [4] AMD IOMMU Specification https://www.amd.com/system/files/TechDocs/48882_3.07_PUB.pdf [5] KVM Forum 2020 Presentation https://tinyurl.com/2p8b543c [6] KVM Forum 2021 Presentation https://tinyurl.com/49sy42ry Thank you, Suravee Suthikulpanit Suravee Suthikulpanit (21): iommu/amd: Declare helper functions as extern iommu/amd: Clean up spacing in amd_iommu_ops declaration iommu/amd: Update PASID, GATS, and GLX feature related macros iommu/amd: Modify domain_enable_v2() to add giov parameter iommu/amd: Refactor set_dte_entry() helper function iommu/amd: Modify set_dte_entry() to add gcr3 input parameter iommu/amd: Modify set_dte_entry() to add user domain input parameter iommu/amd: Allow nested IOMMU page tables iommu/amd: Add support for hw_info for iommu capability query iommu/amd: Introduce vIOMMU-specific events and event info iommu/amd: Introduce Reset vMMIO Command iommu/amd: Introduce AMD vIOMMU-specific UAPI iommu/amd: Introduce vIOMMU command-line option iommu/amd: Initialize vIOMMU private address space regions iommu/amd: Introduce vIOMMU vminit and vmdestroy ioctl iommu/amd: Introduce vIOMMU ioctl for updating device mapping table iommu/amd: Introduce vIOMMU ioctl for updating domain mapping iommu/amd: Introduce vIOMMU ioctl for handling guest MMIO accesses iommu/amd: Introduce vIOMMU ioctl for handling command buffer mapping iommu/amd: Introduce vIOMMU ioctl for setting up guest CR3 iommufd: Introduce AMD HW-vIOMMU IOCTL drivers/iommu/amd/Makefile | 2 +- drivers/iommu/amd/amd_iommu.h | 40 +- drivers/iommu/amd/amd_iommu_types.h | 62 +- drivers/iommu/amd/amd_viommu.h | 57 ++ drivers/iommu/amd/init.c | 29 +- drivers/iommu/amd/io_pgtable.c | 18 +- drivers/iommu/amd/iommu.c | 370 +++++++-- drivers/iommu/amd/iommu_v2.c | 2 +- drivers/iommu/amd/viommu.c | 1110 +++++++++++++++++++++++++++ drivers/iommu/iommufd/Makefile | 3 +- drivers/iommu/iommufd/amd_viommu.c | 158 ++++ drivers/iommu/iommufd/main.c | 17 +- include/linux/amd-viommu.h | 26 + include/linux/iommu.h | 1 + include/linux/iommufd.h | 8 + include/uapi/linux/amd_viommu.h | 145 ++++ include/uapi/linux/iommufd.h | 31 + 17 files changed, 1964 insertions(+), 115 deletions(-) create mode 100644 drivers/iommu/amd/amd_viommu.h create mode 100644 drivers/iommu/amd/viommu.c create mode 100644 drivers/iommu/iommufd/amd_viommu.c create mode 100644 include/linux/amd-viommu.h create mode 100644 include/uapi/linux/amd_viommu.h
Comments
On Wed, Jun 21, 2023 at 06:54:47PM -0500, Suravee Suthikulpanit wrote: > Since the IOMMU hardware virtualizes the guest command buffer, this allows > IOMMU operations to be accelerated such as invalidation of guest pages > (i.e. stage1) when the command is issued by the guest kernel without > intervention from the hypervisor. This is similar to what we are doing on ARM as well. > This series is implemented on top of the IOMMUFD framework. It leverages > the exisiting APIs and ioctls for providing guest iommu information > (i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page > table information (i.e. struct iommu_hwpt_amd_v2) for setting up user > domain. > > Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU. > > NOTES > ----- > This series is organized into two parts: > * Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8). > > * Part2: Introducing HW-vIOMMU support (Patch 9-21). > > * Patch 12 and 21 extends the existing IOMMUFD ioctls to support > additional opterations, which can be categorized into: > - Ioctls to init/destroy AMD HW-vIOMMU instance > - Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance. > - Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance. > - Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses. > - Ioctls to trap AMD HW-vIOMMU command buffer initialization. No one else seems to need this kind of stuff, why is AMD different? Emulation and mediation to create the vIOMMU is supposed to be in the VMM side, not in the kernel. I don't want to see different models by vendor. Even stuff like setting up the gcr3 should not be it's own ioctls, that is now how we are modeling things at all. I think you need to take smaller steps in line with the other drivers so we can all progress through this step by step together. To start focus only on user space page tables and kernel mediated invalidation and fit into the same model as everyone else. This is approx the same patches and uAPI you see for ARM and Intel. AFAICT AMD's HW is very similar to ARM's, so you should be aligning to the ARM design. Then maybe we can argue if a kernel vIOMMU emulation/mediation is appropriate or not, but this series is just too much as is. I also want to see the AMD driver align with the new APIs for PASID/etc before we start shovling more stuff into it. This is going to be part of the iommufd contract as well, I'm very unhappy to see drivers pick and choosing what part of the contract they implement. Regards, Jason
Jason, On 6/22/2023 6:46 AM, Jason Gunthorpe wrote: > On Wed, Jun 21, 2023 at 06:54:47PM -0500, Suravee Suthikulpanit wrote: > >> Since the IOMMU hardware virtualizes the guest command buffer, this allows >> IOMMU operations to be accelerated such as invalidation of guest pages >> (i.e. stage1) when the command is issued by the guest kernel without >> intervention from the hypervisor. > > This is similar to what we are doing on ARM as well. Ok >> This series is implemented on top of the IOMMUFD framework. It leverages >> the exisiting APIs and ioctls for providing guest iommu information >> (i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page >> table information (i.e. struct iommu_hwpt_amd_v2) for setting up user >> domain. >> >> Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU. >> >> NOTES >> ----- >> This series is organized into two parts: >> * Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8). >> >> * Part2: Introducing HW-vIOMMU support (Patch 9-21). >> >> * Patch 12 and 21 extends the existing IOMMUFD ioctls to support >> additional opterations, which can be categorized into: >> - Ioctls to init/destroy AMD HW-vIOMMU instance >> - Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance. >> - Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance. >> - Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses. >> - Ioctls to trap AMD HW-vIOMMU command buffer initialization. > > No one else seems to need this kind of stuff, why is AMD different? > > Emulation and mediation to create the vIOMMU is supposed to be in the > VMM side, not in the kernel. I don't want to see different models by > vendor. These ioctl is not necessary for emulation, which I would agree that it should be done on the VMM side (e.g. QEMU). These ioctls provides necessary information for programming the AMD IOMMU hardware to provide hardware-assisted virtualized IOMMU. This includes programing certain data structures i.e. Domain ID mapping table (DomIDMap), Device ID mapping table (DevIDMap), and certain MMIO registers for controlling the HW-vIOMMU feature. > Even stuff like setting up the gcr3 should not be it's own ioctls, > that is now how we are modeling things at all. Sorry for miscommunication regarding the ioctl for setting up gcr3 in the commit log message for patch 20 and causing confusion. I'll update the message accordingly. Please allow me to clarify this briefly here. In this series, AMD IOMMU GCR3 table is actually setup when the IOMMUFD_CMD_HWPT_ALLOC is called, which the driver provides a hook to struct iommu_ops.domain_alloc_user(). The AMD-specific information is communicated from QEMU via iommu_domain_user_data.iommu_hwpt_amd_v2. This is similar to INTEL and ARM. Please also note that for the AMD HW-vIOMMU device model in QEMU, the guest memory used for IOMMU device table is trapped on when guest IOMMU driver programs the guest Device Table Entry (gDTE). Then QEMU reads the content of gDTE to extract necessary information for setting up guest (stage-1) page table, and calls iommufd_backend_alloc_hwpt(). There are still work to be done in this to fully support PASID. I'll take a look at this next. > I think you need to take smaller steps in line with the other > drivers so we can all progress through this step by step together. I can certainly breakdown the patch series in to smaller parts to align with the rest. > To start focus only on user space page tables and kernel mediated > invalidation and fit into the same model as everyone else. This is > approx the same patches and uAPI you see for ARM and Intel. AFAICT > AMD's HW is very similar to ARM's, so you should be aligning to the > ARM design. I think the user space page table is covered as described above. As for the kernel mediated invalidation, IIUC from looking at the patches: * iommufd: Add nesting related data structures for ARM SMMUv3 (https://github.com/yiliu1765/iommufd/commit/b6a5c8991dcc96ca895b53175c93e5fc522f42fe) * iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user (https://github.com/yiliu1765/iommufd/commit/0ae59149474ad0cb8a42ff7e71ed6b4e9df00204) it seems that user-space is supposed to call the ioctl IOMMUFD_CMD_HWPT_INVALIDATE for both INTEL and ARM to issue invalidation for stage 1 page table. Please lemme know if I misunderstand the purpose of this ioctl. However, for AMD since the HW-vIOMMU virtualizes the guest command buffer, and when it sees the page table invalidation command in the guest command buffer, it takes care of the invalidation using information in the DomIDMap, which maps guest domain ID (gDomID) of a particular guest to the corresponding host domain ID (hDomID) of the device and invalidate the nested translation according to the specified PASID, DomID, and GVA. The DomIDMap is setup by the host IOMMU driver during VM initialization. When the guest IOMMU driver attaches the VFIO pass-through device to a guest iommu_group (i.e domain), it programs the gDTE with the gDomID. This action is trapped into QEMU and the gDomID is read from the gDTE and communicated to hypervisor via the newly proposed ioctl VIOMMU_DOMAIN_ATTACH. Now the DomIDMap is created for the VFIO device. > Then maybe we can argue if a kernel vIOMMU emulation/mediation is > appropriate or not, but this series is just too much as is. Sure, we can continue to discuss the implementation detail for each part separately. > I also want to see the AMD driver align with the new APIs for > PASID/etc before we start shovling more stuff into it. Are you referring to the IOMMU API for SVA/PASID stuff: * struct iommu_domain_ops.set_dev_pasid() * struct iommu_ops.remove_dev_pasid() * ... If so, we are working on it separately in parallel, and will be sending out RFC soon. Otherwise, could you please point me what "new APIs for PASID/etc" you are referring to in particular? I might have missed something here. > This is going to be part of the iommufd contract as well, I'm very unhappy to see > drivers pick and choosing what part of the contract they implement. Sorry, didn't mean to disappoint :) Lemme look into this part more and will try to be more compliance with the contract in the next RFC. Thanks, Suravee
On Thu, Jun 22, 2023 at 06:15:17PM -0700, Suthikulpanit, Suravee wrote: > Jason, > > On 6/22/2023 6:46 AM, Jason Gunthorpe wrote: > > On Wed, Jun 21, 2023 at 06:54:47PM -0500, Suravee Suthikulpanit wrote: > > > > > Since the IOMMU hardware virtualizes the guest command buffer, this allows > > > IOMMU operations to be accelerated such as invalidation of guest pages > > > (i.e. stage1) when the command is issued by the guest kernel without > > > intervention from the hypervisor. > > > > This is similar to what we are doing on ARM as well. > > Ok > > > > This series is implemented on top of the IOMMUFD framework. It leverages > > > the exisiting APIs and ioctls for providing guest iommu information > > > (i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page > > > table information (i.e. struct iommu_hwpt_amd_v2) for setting up user > > > domain. > > > > > > Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU. > > > > > > NOTES > > > ----- > > > This series is organized into two parts: > > > * Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8). > > > > > > * Part2: Introducing HW-vIOMMU support (Patch 9-21). > > > > > > * Patch 12 and 21 extends the existing IOMMUFD ioctls to support > > > additional opterations, which can be categorized into: > > > - Ioctls to init/destroy AMD HW-vIOMMU instance > > > - Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance. > > > - Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance. > > > - Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses. > > > - Ioctls to trap AMD HW-vIOMMU command buffer initialization. > > > > No one else seems to need this kind of stuff, why is AMD different? > > > > Emulation and mediation to create the vIOMMU is supposed to be in the > > VMM side, not in the kernel. I don't want to see different models by > > vendor. > > These ioctl is not necessary for emulation, which I would agree that it > should be done on the VMM side (e.g. QEMU). These ioctls provides necessary > information for programming the AMD IOMMU hardware to provide > hardware-assisted virtualized IOMMU. You have one called 'trap', it shouldn't be like this. It seems like this is trying to parse the command buffer in the kernel, it should be done in the VMM. > In this series, AMD IOMMU GCR3 table is actually setup when the > IOMMUFD_CMD_HWPT_ALLOC is called, which the driver provides a hook to struct > iommu_ops.domain_alloc_user(). That isn't entirely right either, the GCR3 should be programmed into HW during iommu_domain attach. > The AMD-specific information is communicated from QEMU via > iommu_domain_user_data.iommu_hwpt_amd_v2. This is similar to INTEL > and ARM. This is only for requesting the iommu_domain and supplying the gcr3 VA for later use. > > Please also note that for the AMD HW-vIOMMU device model in QEMU, the guest > memory used for IOMMU device table is trapped on when guest IOMMU driver > programs the guest Device Table Entry (gDTE). Then QEMU reads the content of > gDTE to extract necessary information for setting up guest (stage-1) page > table, and calls iommufd_backend_alloc_hwpt(). This is the same as ARM. It is a two step operation, you de-duplicate the gDTE entries (eg to share vDIDs), allocating a HWPT if it doesn't already exist, then you attach the HWPT to the physical device the gDTE's vRID implies. > There are still work to be done in this to fully support PASID. I'll > take a look at this next. I would expect PASID work is only about invalidation? > > To start focus only on user space page tables and kernel mediated > > invalidation and fit into the same model as everyone else. This is > > approx the same patches and uAPI you see for ARM and Intel. AFAICT > > AMD's HW is very similar to ARM's, so you should be aligning to the > > ARM design. > > I think the user space page table is covered as described above. I'm not sure, it doesn't look like it is what I would expect. > it seems that user-space is supposed to call the ioctl > IOMMUFD_CMD_HWPT_INVALIDATE for both INTEL and ARM to issue invalidation for > stage 1 page table. Please lemme know if I misunderstand the purpose of this > ioctl. Yes, the VMM traps the invalidation and issues it like this. > However, for AMD since the HW-vIOMMU virtualizes the guest command buffer, > and when it sees the page table invalidation command in the guest command > buffer, it takes care of the invalidation using information in the DomIDMap, > which maps guest domain ID (gDomID) of a particular guest to the > corresponding host domain ID (hDomID) of the device and invalidate the > nested translation according to the specified PASID, DomID, and GVA. The VMM should do all of this stuff. The VMM parses the command buffer and the VMM converts the commands to invalidation ioctls. I'm a unclear if AMD supports a mode where the HW can directly operate a command/invalidation queue in the VM without virtualization. Eg DMA from guest memory and deliver directly to the guest completion interrupts. If it always needs SW then the SW part should be in the VMM, not the kernel. Then you don't need to load all these tables into the kernel. > The DomIDMap is setup by the host IOMMU driver during VM initialization. > When the guest IOMMU driver attaches the VFIO pass-through device to a guest > iommu_group (i.e domain), it programs the gDTE with the gDomID. This action > is trapped into QEMU and the gDomID is read from the gDTE and communicated > to hypervisor via the newly proposed ioctl VIOMMU_DOMAIN_ATTACH. Now the > DomIDMap is created for the VFIO device. The gDomID should be supplied when the HWPT is allocated, not via new ioctls. > Are you referring to the IOMMU API for SVA/PASID stuff: > * struct iommu_domain_ops.set_dev_pasid() > * struct iommu_ops.remove_dev_pasid() > * ... Yes > If so, we are working on it separately in parallel, and will be sending out > RFC soon. Great Jason
Jason, On 6/23/2023 4:45 AM, Jason Gunthorpe wrote: > On Thu, Jun 22, 2023 at 06:15:17PM -0700, Suthikulpanit, Suravee wrote: >> Jason, >> >> On 6/22/2023 6:46 AM, Jason Gunthorpe wrote: >>> On Wed, Jun 21, 2023 at 06:54:47PM -0500, Suravee Suthikulpanit wrote: >>> >>>> Since the IOMMU hardware virtualizes the guest command buffer, this allows >>>> IOMMU operations to be accelerated such as invalidation of guest pages >>>> (i.e. stage1) when the command is issued by the guest kernel without >>>> intervention from the hypervisor. >>> >>> This is similar to what we are doing on ARM as well. >> >> Ok >> >>>> This series is implemented on top of the IOMMUFD framework. It leverages >>>> the exisiting APIs and ioctls for providing guest iommu information >>>> (i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page >>>> table information (i.e. struct iommu_hwpt_amd_v2) for setting up user >>>> domain. >>>> >>>> Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU. >>>> >>>> NOTES >>>> ----- >>>> This series is organized into two parts: >>>> * Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8). >>>> >>>> * Part2: Introducing HW-vIOMMU support (Patch 9-21). >>>> >>>> * Patch 12 and 21 extends the existing IOMMUFD ioctls to support >>>> additional opterations, which can be categorized into: >>>> - Ioctls to init/destroy AMD HW-vIOMMU instance >>>> - Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance. >>>> - Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance. I'm re-looking into these three a bit and will get back. >>>> - Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses. To describe the need for this ioctl, AMD IOMMU has two set of MMIO registers: 1. Control MMIO 2. Data MMIO For AMD HW-vIOMMU, the hardware define a private memory address space (PAS) containing VF Control MMIO and VF MMIO register for each guest IOMMU instance, which represents the guest view of the AMD IOMMU MMIO registers. This memory is also accessed by the IOMMU hardware to virtualize the guest MMIO register. When the guest IOMMU driver write to guest control MMIO register of the QEMU AMD HW-vIOMMU device model, it traps into QEMU. QEMU reads the value call VIOMMU_MMIO_ACCESS to tell the AMD IOMMU driver in the host to program VFCtrlMMIO or VFMMIO register for this guest. Similar for the read on guest control MMIO register, QEMU calls ioctl to get the value from AMD iommu driver, which reads the guest VFCtrlMMIO or VFMMIO register and provide back to the guest. >>>> - Ioctls to trap AMD HW-vIOMMU command buffer initialization. For this ioctl, the IOMMU hardware define an IOMMU PAS containing a command buffer for each guest IOMMU instance. This memory is also accessed by IOMMU hardware to virtualize the guest command buffer. When the guest IOMMU driver write to guest Command Buffer Base Address MMIO register of the QEMU AMD HW-vIOMMU device model, it traps into QEMU. QEMU reads the value, parse the GPA, and translate to HVA. Then it calls VIOMMU_CMDBUF_UPDATE to communicate the HVA to IOMMU driver to map it in the IOMMU PAS so that it use this memory to virtualize the guest command buffer. >>> >>> No one else seems to need this kind of stuff, why is AMD different? >>> >>> Emulation and mediation to create the vIOMMU is supposed to be in the >>> VMM side, not in the kernel. I don't want to see different models by >>> vendor. >> >> These ioctl is not necessary for emulation, which I would agree that it >> should be done on the VMM side (e.g. QEMU). These ioctls provides necessary >> information for programming the AMD IOMMU hardware to provide >> hardware-assisted virtualized IOMMU. > > You have one called 'trap', it shouldn't be like this. It seems like > this is trying to parse the command buffer in the kernel, it should be > done in the VMM. Please see the more detail description above. Basically, all parsing is done in the VMM, and it use the ioctl to tell IOMMU driver to program the VFCtrlMMIO/VFMMIO registers or IOMMU PAS for the hardware to access. >> In this series, AMD IOMMU GCR3 table is actually setup when the >> IOMMUFD_CMD_HWPT_ALLOC is called, which the driver provides a hook to struct >> iommu_ops.domain_alloc_user(). > > That isn't entirely right either, the GCR3 should be programmed into > HW during iommu_domain attach. > >> The AMD-specific information is communicated from QEMU via >> iommu_domain_user_data.iommu_hwpt_amd_v2. This is similar to INTEL >> and ARM. > > This is only for requesting the iommu_domain and supplying the gcr3 VA > for later use. Ah, ok. Lemme look into this again and get back to you. >.... > >> There are still work to be done in this to fully support PASID. I'll >> take a look at this next. > > I would expect PASID work is only about invalidation? Actually, I am referring to supporting non-zero PASID, which requires walking the guest IOMMU gCR3 table and communicate this to the hypervisor. >>> To start focus only on user space page tables and kernel mediated >>> invalidation and fit into the same model as everyone else. This is >>> approx the same patches and uAPI you see for ARM and Intel. AFAICT >>> AMD's HW is very similar to ARM's, so you should be aligning to the >>> ARM design. >> >> I think the user space page table is covered as described above. > > I'm not sure, it doesn't look like it is what I would expect. Lemme clean up this part and get back in next RFC. >> It seems that user-space is supposed to call the ioctl >> IOMMUFD_CMD_HWPT_INVALIDATE for both INTEL and ARM to issue invalidation for >> stage 1 page table. Please lemme know if I misunderstand the purpose of this >> ioctl. > > Yes, the VMM traps the invalidation and issues it like this. > >> However, for AMD since the HW-vIOMMU virtualizes the guest command buffer, >> and when it sees the page table invalidation command in the guest command >> buffer, it takes care of the invalidation using information in the DomIDMap, >> which maps guest domain ID (gDomID) of a particular guest to the >> corresponding host domain ID (hDomID) of the device and invalidate the >> nested translation according to the specified PASID, DomID, and GVA. > > The VMM should do all of this stuff. The VMM parses the command buffer > and the VMM converts the commands to invalidation ioctls. > > I'm a unclear if AMD supports a mode where the HW can directly operate > a command/invalidation queue in the VM without virtualization. Eg DMA > from guest memory and deliver directly to the guest completion > interrupts. Correct, VMM does not need to parse the command buffer. The hardware takes care of virtualizing the invalidation commands in the guest command buffer directly buffer w/o VMM helps to do invalidation from the host side. For AMD IOMMU, the invalidation command is normally followed by the COMPLETION_WAIT command on a memory semaphore, in which the hardware updates after all the prior commands are completed. For Linux, we are not using Completion Wait interrupt. The iommu driver polls on the memory semphore in a loop. > If it always needs SW then the SW part should be in the VMM, not the > kernel. Then you don't need to load all these tables into the kernel. > As described, the IOMMU driver needs to program the IOMMU PAS. IOMMU hardware uses its own IOMMU page table to access the PAS. For example, an AMD IOMMU hardware is normally listed as a PCI device (e.g. PCI ID 00:00.2). To setup IOMMU PAS for this IOMMU instance, the IOMMU driver allocate an IOMMU v1 page table for this device, which contains PAS mapping. The IOMMU hardware use the PAS for storing Guest IOMMU information such as Guest MMIOs, DevID Mapping Table, DomID Mapping Table, and Guest Command/Event/PPR logs. Thanks, Suravee
On Fri, Jun 23, 2023 at 03:05:06PM -0700, Suthikulpanit, Suravee wrote: > For example, an AMD IOMMU hardware is normally listed as a PCI device (e.g. > PCI ID 00:00.2). To setup IOMMU PAS for this IOMMU instance, the IOMMU > driver allocate an IOMMU v1 page table for this device, which contains PAS > mapping. So it is just system dram? > The IOMMU hardware use the PAS for storing Guest IOMMU information such as > Guest MMIOs, DevID Mapping Table, DomID Mapping Table, and Guest > Command/Event/PPR logs. Why does it have to be in kernel memory? Why not store the whole thing in user mapped memory and have the VMM manipulate it directly? Jason
On 6/23/2023 3:56 PM, Jason Gunthorpe wrote: > On Fri, Jun 23, 2023 at 03:05:06PM -0700, Suthikulpanit, Suravee wrote: > >> For example, an AMD IOMMU hardware is normally listed as a PCI device (e.g. >> PCI ID 00:00.2). To setup IOMMU PAS for this IOMMU instance, the IOMMU >> driver allocate an IOMMU v1 page table for this device, which contains PAS >> mapping. > > So it is just system dram? Yes, this is no different than the IOMMU page table for a particular device, contain mapping from IOMMU Private Address (IPA) to SPA. The IPA is defined in the IOMMU spec. Please see Figure 79 and 80 of this documentation for IPA mapping used by the hardware. https://www.amd.com/system/files/TechDocs/48882_3.07_PUB.pdf >> The IOMMU hardware use the PAS for storing Guest IOMMU information such as >> Guest MMIOs, DevID Mapping Table, DomID Mapping Table, and Guest >> Command/Event/PPR logs. > > Why does it have to be in kernel memory? > > Why not store the whole thing in user mapped memory and have the VMM > manipulate it directly? The Guest MMIO, CmdBuf Dirty Status, are allocated per IOMMU instance. So, these data structure cannot be allocated by VMM. In this case, the IOMMUFD_CMD_MMIO_ACCESS might still be needed. The DomID and DevID mapping tables are allocated per-VM: * DomID Mapping Table (512 KB contiguous memory) * DevID Mapping Table (1 MB contiguous memory) Let's say we can use IOMMU_SET_DEV_DATA to communicate the memory address of Dom/DevID Mapping tables to IOMMU driver to pin and map in the PAS IOMMU page table. Then, this might work. Does that go along the line of what you are thinking (mainly to try to avoid introducing additional ioctl)? By the way, I think I can try getting rid of the IOMMUFD_CMD_CMDBUF_UPDATE. Lemme do that in next RFC. Thanks, Suravee
On Fri, Jun 23, 2023 at 07:08:54PM -0700, Suthikulpanit, Suravee wrote: > > > The IOMMU hardware use the PAS for storing Guest IOMMU information such as > > > Guest MMIOs, DevID Mapping Table, DomID Mapping Table, and Guest > > > Command/Event/PPR logs. > > > > Why does it have to be in kernel memory? > > > > Why not store the whole thing in user mapped memory and have the VMM > > manipulate it directly? > > The Guest MMIO, CmdBuf Dirty Status, are allocated per IOMMU instance. So, > these data structure cannot be allocated by VMM. Yes, that is unfortunate so much stuff here wasn't 4k aligned so it could be mapped sensibly. It doesn't really make any sense to have a giant repeated register map that still has to be hypervisor trapped, a command queue would have been more logical :( > In this case, the IOMMUFD_CMD_MMIO_ACCESS might still be needed. It seems this is unavoidable, but it needs a clearer name and purpose. But more importantly we don't really have any object to hang this off of - we don't have the notion of a "VM" in iommufd right now. We had sort of been handwaving that maybe the entire FD is a "VM" and maybe that works for some scenarios, but I don't think it works for what you need, especially if you consider multi-instance. So, it is good that you brought this series right now as I think it needs harmonizing with what ARM needs to do, and this is the more complex version of the two. > The DomID and DevID mapping tables are allocated per-VM: > * DomID Mapping Table (512 KB contiguous memory) > * DevID Mapping Table (1 MB contiguous memory) But these can be mapped into that IPA space at 4k granularity? They just need contiguous IOVA? So the VMM could provide this memory and we don't need calls to manipulate it? > Let's say we can use IOMMU_SET_DEV_DATA to communicate the memory address of > Dom/DevID Mapping tables to IOMMU driver to pin and map in the PAS IOMMU > page table. Then, this might work. Does that go along the line of what you > are thinking (mainly to try to avoid introducing additional ioctl)? I think it makes more sense if memory that is logically part of the VMM is mmap'd to the VMM. Since we have the general design of passing user pointers and pinning them it makes some sense. You could do the same trick as your IPA space and use a IPA IOAS plus an access to set this all up. This has the same issue as above, it needs some formal VM object, as fundamentally you are asking the driver to allocate a limited resource on a specific IOMMU instance and then link that to other actions. Jason