Message ID | 20230110143137.54517-2-suravee.suthikulpanit@amd.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp2779949wrt; Tue, 10 Jan 2023 06:33:57 -0800 (PST) X-Google-Smtp-Source: AMrXdXv5QtlYSY7oPKdu3TqosJMJwopjLqsfuNL3FOXCdwfr1zvg35h/0SjRtq7fdIiPQQ+mIAXL X-Received: by 2002:a05:6402:a55:b0:475:9918:37ce with SMTP id bt21-20020a0564020a5500b00475991837cemr58684198edb.13.1673361237443; Tue, 10 Jan 2023 06:33:57 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1673361237; cv=pass; d=google.com; s=arc-20160816; b=IWhykA3wjeapuflf04op8ImsMKcYWFPgmN5OGKPZEBD5VTSI6GkW7SQ8qLBf7Gqf/0 Zt2BFCmzeIXr8896ZqvElGtxCAsZv7Ksdz8Um+bRJT223Q7INPdr6LhhqxGntXCQRv89 PXdvazXhEfG5rB/gwNsZkHEh1Bf2lL9Vq7IvNAR5wYffQtOUIwPkf1CVN+/saRMHa6p0 ZjkC3tNcW7FgW1s9EDCZZroFeryOnTGpcsWKv1J2/nOek3wiR704UIbScpxpJBgBXiQm SD9K1t7jfHJTdilW6+qAQBt6g8GkGtLTdWOrztxYUKs8TyPZOZYntC3nTSzpCxPB57oh zRZA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=rM/7rBbvFmTloP90cbfDESDHZVvtwEhdd3UXHRXpRbE=; b=WQ+TyRX4/4IQPh5Pd7X3FXsQ8EwPQUn5FWAW2pX+l1eN/th8UXUjbNCV7tGYow2MoG OYnEs/bpPZTG1rGpNkNHlyNvsK2IR1m1ihqWn3Ob3lD4KGncf2rmGyrrRTNFLqt7etTs Lm3lK26FP4XTKTcThavWZIx7AmlzJU0QKSmoFNXGiB9FmKmEGamWpslWgrBGskoG/M8U eb6kh/w8XEIRQDSpv/SMs8vHE/bsRhTz4EcR0lrEnqYzkducceBSCcJN09OTJQOGrGcP US37b0T7JvSFc7bNQsqUkgTgw1Dzh0CjQh/6+crsyYhuretAW0/2jLXfNAJYLLmce7cJ NVNg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=YkaCjFPQ; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k11-20020a50ce4b000000b0048017a9576bsi10569581edj.543.2023.01.10.06.33.33; Tue, 10 Jan 2023 06:33:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=YkaCjFPQ; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238504AbjAJOcW (ORCPT <rfc822;syz17693488234@gmail.com> + 99 others); Tue, 10 Jan 2023 09:32:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238611AbjAJOcH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 10 Jan 2023 09:32:07 -0500 Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-sn1nam02on2069.outbound.protection.outlook.com [40.107.96.69]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C59C37277 for <linux-kernel@vger.kernel.org>; Tue, 10 Jan 2023 06:31:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Q/JmObhkl9+jyLJKujnwUdf3A3DrtJxXhgRyq1FwgDGFiLf95IqYUwbxpuC/YJs3o6umki19gr6nzTg0GpPYVF3E94QXZoEHbP+gI2ormeEpIyRlAApTm1dbMcCss9EqCq0YuQoXBz6B3mtVA0k+9K7s98BtiD1rY5cgH32PHvCdvBZErnYSzP0L9LI/lkPenFG7Tn/NWpDULI6lgPQ/cRZwMzo+kHPVCMPau+YztjbcVjRIKzqRfrIAzrupS0MHKj4NqKN7D8HVM5NtqfClP3TvmU4diJhktIwedcl9002Ygw9jNE1S2QkZXKp5niVjfgCHPAuBhlfEWiC7nzzNTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rM/7rBbvFmTloP90cbfDESDHZVvtwEhdd3UXHRXpRbE=; b=jI5e35NZ/fbexZlrinfXUBFP5B28Kd81ND16CWii1ZrG18L/G2oJM2tdNXwapU3Qhw5TykgIiW95AjJmznHrukhXZ5MyLJm/o51v48vLY435p7n0/T12IxmemgZXk0kt2ulT2tSPRyM7YHbTe2Me2tcPJztgVRyivuBA6wxZHWujfgyqw1Ouni9FgTT25oV+6MqfCOSFEfm/uaDNeHGwR8u712v/G8/al4keF/gigJxFGQ3kOHVIHEVeV99jom9VBPDv13ActRH01Asjo65MP7VOIYCueehzvpQ3FjNNSKJCX6iPDgaT3/L1Euy9JmhIZUwoAnBo66CX03fpjNIkiQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rM/7rBbvFmTloP90cbfDESDHZVvtwEhdd3UXHRXpRbE=; b=YkaCjFPQJ9az6l+X5e2x+4Qk/WbImO2bR9CW26Vn454eRC04LOvx3cYnqm0gpiK1qb16SX1YQEi3nwZTlCiokfGkP5b9IP8WUmVS3JJy43aRrbT2XRw7ofzSECXFqJlkuExycRn9Y2YXzF6IszvAur8ngNNivz9ZPm6itVVAn90= Received: from BL1PR13CA0262.namprd13.prod.outlook.com (2603:10b6:208:2ba::27) by PH7PR12MB7892.namprd12.prod.outlook.com (2603:10b6:510:27e::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5986.18; Tue, 10 Jan 2023 14:31:56 +0000 Received: from BL02EPF0000C402.namprd05.prod.outlook.com (2603:10b6:208:2ba:cafe::e7) by BL1PR13CA0262.outlook.office365.com (2603:10b6:208:2ba::27) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6002.6 via Frontend Transport; Tue, 10 Jan 2023 14:31:56 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by BL02EPF0000C402.mail.protection.outlook.com (10.167.241.4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6002.11 via Frontend Transport; Tue, 10 Jan 2023 14:31:56 +0000 Received: from sp5-759chost.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Tue, 10 Jan 2023 08:31:55 -0600 From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> To: <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev> CC: <joro@8bytes.org>, <robin.murphy@arm.com>, <ashish.kalra@amd.com>, <thomas.lendacky@amd.com>, <vasant.hegde@amd.com>, <jon.grimm@amd.com>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Subject: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO Date: Tue, 10 Jan 2023 08:31:34 -0600 Message-ID: <20230110143137.54517-2-suravee.suthikulpanit@amd.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20230110143137.54517-1-suravee.suthikulpanit@amd.com> References: <20230110143137.54517-1-suravee.suthikulpanit@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BL02EPF0000C402:EE_|PH7PR12MB7892:EE_ X-MS-Office365-Filtering-Correlation-Id: 6dabf742-5c42-48e0-db5f-08daf3176cf5 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: co8R8f9Oktuh1gclZtNKRE7ZNXRbHfPr1Z37K1wys51wAdVc+8NxhEsO6r9/hBaB7sD9nkD6+fKEXImu4ZPE+yc7O5olhEOpBdd0gELjcvjHJDCOm1K8J4AgYkwxlY4rFODY4Bl8SZe5u4Pg01BovutQWdmb/bpx4shM0piolyD6rcw6fJHSbkacnk+HZnrKh+/2KcPMQFFTCYe8RNeECB82qI/SqVrSHLgRHK5jkihf2EFN76YMl5MSA15UVScZ4/iGhmNgvAI4xD7+FkmbTI4LAsNnOQZSWUiTIR0omvmwR+cFGFWLzP6LLyEnmVSpj7oBOOGk8WnGmCFg4ilH4w282E48sjD1Inevjwovw1whJyxN/c1AbCFkmtRQX0v1cfRiZK+K811vFe+Ej1J3FbV1reMlZaEvqYLz19mWONXzaElHUGvPkFsjQaK+dmgSPzw0Qx92J9yPDcv/arTOw9WIyTqSvOhwLKqovtgq3cpjq3gU63p5vm1Pth/neUFZ0YG/poyoT21jxqq6xvB5El81lXCQerWjzq/XGlcwj7Ndf4OkmgWLDriO4pDgxc25UESy3/zPpqn2UEflLbtNQi4B7/zUbU6sbzTSKRtDLQKafHmfbjvWFkOy2EbBa6ZTri55qF5sVx9U1pj6e4ZeYSNC3JTric/C9aOaH+8Wug5NPvOEIdFSbm71C043bpZCX7A1LF9fV756OtHZ3t0VLgVsbAgVNp+pJ4znfxHEERE= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230022)(4636009)(346002)(136003)(396003)(376002)(39860400002)(451199015)(36840700001)(46966006)(40470700004)(36756003)(316002)(54906003)(40460700003)(186003)(2906002)(110136005)(82740400003)(478600001)(356005)(81166007)(86362001)(36860700001)(44832011)(40480700001)(16526019)(2616005)(26005)(6666004)(83380400001)(70586007)(5660300002)(7696005)(41300700001)(336012)(1076003)(8676002)(426003)(8936002)(47076005)(4326008)(70206006)(82310400005)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Jan 2023 14:31:56.3726 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 6dabf742-5c42-48e0-db5f-08daf3176cf5 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BL02EPF0000C402.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR12MB7892 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754646432587472909?= X-GMAIL-MSGID: =?utf-8?q?1754646432587472909?= |
Series |
iommu/amd: Force SNP-enabled VFIO domain to 4K page size
|
|
Commit Message
Suravee Suthikulpanit
Jan. 10, 2023, 2:31 p.m. UTC
Currently, to detect if a domain is enabled with VFIO support, the driver
checks if the domain has devices attached and check if the domain type is
IOMMU_DOMAIN_UNMANAGED.
To be more explicit, introduce protection-domain flag PD_VFIO_MASK
to signify an VFIO-enabled domain is enabled with VFIO support.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/amd_iommu_types.h | 1 +
drivers/iommu/amd/iommu.c | 7 +++++--
2 files changed, 6 insertions(+), 2 deletions(-)
Comments
Hi Suravee, I love your patch! Perhaps something to improve: [auto build test WARNING on awilliam-vfio/for-linus] [also build test WARNING on linus/master v6.2-rc3 next-20230110] [cannot apply to joro-iommu/next awilliam-vfio/next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Suravee-Suthikulpanit/iommu-amd-Introduce-Protection-domain-flag-VFIO/20230110-223527 base: https://github.com/awilliam/linux-vfio.git for-linus patch link: https://lore.kernel.org/r/20230110143137.54517-2-suravee.suthikulpanit%40amd.com patch subject: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO config: x86_64-defconfig compiler: gcc-11 (Debian 11.3.0-8) 11.3.0 reproduce (this is a W=1 build): # https://github.com/intel-lab-lkp/linux/commit/ea80bf7e918395d3445c0114fee20997512a4ccf git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Suravee-Suthikulpanit/iommu-amd-Introduce-Protection-domain-flag-VFIO/20230110-223527 git checkout ea80bf7e918395d3445c0114fee20997512a4ccf # save the config file mkdir build_dir && cp config build_dir/.config make W=1 O=build_dir ARCH=x86_64 olddefconfig make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/iommu/amd/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): drivers/iommu/amd/iommu.c: In function 'amd_iommu_detach_device': >> drivers/iommu/amd/iommu.c:2136:35: warning: unused variable 'domain' [-Wunused-variable] 2136 | struct protection_domain *domain = to_pdomain(dom); | ^~~~~~ vim +/domain +2136 drivers/iommu/amd/iommu.c 2131 2132 static void amd_iommu_detach_device(struct iommu_domain *dom, 2133 struct device *dev) 2134 { 2135 struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev); > 2136 struct protection_domain *domain = to_pdomain(dom); 2137 struct amd_iommu *iommu; 2138 2139 if (!check_device(dev)) 2140 return; 2141 2142 if (dev_data->domain != NULL) 2143 detach_device(dev); 2144 2145 iommu = rlookup_amd_iommu(dev); 2146 if (!iommu) 2147 return; 2148
On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: > Currently, to detect if a domain is enabled with VFIO support, the driver > checks if the domain has devices attached and check if the domain type is > IOMMU_DOMAIN_UNMANAGED. NAK If you need weird HW specific stuff like this then please implement it properly in iommufd, not try and randomly guess what things need from the domain type. All this confidential computing stuff needs a comprehensive solution, not some piecemeal mess. How can you even use a CC guest with VFIO in the upstream kernel? Hmm? Jason
Hello Jason, On 1/13/2023 9:33 AM, Jason Gunthorpe wrote: > On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: >> Currently, to detect if a domain is enabled with VFIO support, the driver >> checks if the domain has devices attached and check if the domain type is >> IOMMU_DOMAIN_UNMANAGED. > > NAK > > If you need weird HW specific stuff like this then please implement it > properly in iommufd, not try and randomly guess what things need from > the domain type. > > All this confidential computing stuff needs a comprehensive solution, > not some piecemeal mess. How can you even use a CC guest with VFIO in > the upstream kernel? Hmm? > Currently all guest devices are untrusted - whether they are emulated, virtio or passthrough. In the current use case of VFIO device-passthrough to an SNP guest, the pass-through device will perform DMA to un-encrypted or shared guest memory, in the same way as virtio or emulated devices. This fix is prompted by an issue reported by Nvidia, they are trying to do PCIe device passthrough to SNP guest. The memory allocated for DMA is through dma_alloc_coherent() in the SNP guest and during DMA I/O an RMP_PAGE_FAULT is observed on the host. These dma_alloc_coherent() calls map into page state change hypercalls into the host to change guest page state from encrypted to shared in the RMP table. Following is a link to issue discussed above: https://github.com/AMDESE/AMDSEV/issues/109 Now, to set individual 4K entries to different shared/private mappings in NPT or host page tables for large page entries, the RMP and NPT/host page table large page entries are split to 4K pte’s. The same split is required in iommu page table entries to remain in sync with the corresponding RMP table entry. If the iommu entry is covering a range with a large page entry and the individual 4K mappings in that range have now changed, the RMP checks during IOMMU page table walk will cause a RMP page fault to be signaled. The fix is to force 4K page size for IOMMU page tables for SNP guests. This patch-set adds support to detect if a domain belongs to an SNP-enabled guest. This way it can set default page size of a domain to 4K only for SNP-enabled guest and allow non-SNP guest to use larger page size. Hopefully, this explains the usage case for this patch-set. Thanks, Ashish
On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote: > Hello Jason, > > On 1/13/2023 9:33 AM, Jason Gunthorpe wrote: > > On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: > > > Currently, to detect if a domain is enabled with VFIO support, the driver > > > checks if the domain has devices attached and check if the domain type is > > > IOMMU_DOMAIN_UNMANAGED. > > > > NAK > > > > If you need weird HW specific stuff like this then please implement it > > properly in iommufd, not try and randomly guess what things need from > > the domain type. > > > > All this confidential computing stuff needs a comprehensive solution, > > not some piecemeal mess. How can you even use a CC guest with VFIO in > > the upstream kernel? Hmm? > > > > Currently all guest devices are untrusted - whether they are emulated, > virtio or passthrough. In the current use case of VFIO device-passthrough to > an SNP guest, the pass-through device will perform DMA to un-encrypted or > shared guest memory, in the same way as virtio or emulated devices. > > This fix is prompted by an issue reported by Nvidia, they are trying to do > PCIe device passthrough to SNP guest. The memory allocated for DMA is > through dma_alloc_coherent() in the SNP guest and during DMA I/O an > RMP_PAGE_FAULT is observed on the host. > > These dma_alloc_coherent() calls map into page state change hypercalls into > the host to change guest page state from encrypted to shared in the RMP > table. > > Following is a link to issue discussed above: > https://github.com/AMDESE/AMDSEV/issues/109 Wow you should really write all of this in the commmit message > Now, to set individual 4K entries to different shared/private > mappings in NPT or host page tables for large page entries, the RMP > and NPT/host page table large page entries are split to 4K pte’s. Why are mappings to private pages even in the iommu in the first place - and how did they even get there? I thought the design for the private memory was walling it off in a memfd and making it un-gup'able? This seems to be your actual problem, somehow the iommu is being loaded with private memory PFNs instead of only being loaded with shared PFNs when shared mappings are created? If the IOMMU mappings actually only extend to the legitimate shared pages then you don't have a problem with large IOPTEs spanning a mixture of page types. > The fix is to force 4K page size for IOMMU page tables for SNP guests. But even if you want to persue this as the fix, it should not be done in this way. > This patch-set adds support to detect if a domain belongs to an SNP-enabled > guest. This way it can set default page size of a domain to 4K only for > SNP-enabled guest and allow non-SNP guest to use larger page size. As I said, the KVM has nothing to do with the iommu and I want to laregly keep it that way. If the VMM needs to request a 4k page size only iommu_domain because it is somehow mapping mixtures of private and public pages, then the VMM knows it is doing this crazy thing and it needs to ask iommufd directly for customized iommu_domain from the driver. No KVM interconnection. In fact, we already have a way to do this in iommufd generically, have the VMM set IOMMU_OPTION_HUGE_PAGES = 0. Jason
On 1/19/2023 11:44 AM, Jason Gunthorpe wrote: > On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote: >> Hello Jason, >> >> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote: >>> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: >>>> Currently, to detect if a domain is enabled with VFIO support, the driver >>>> checks if the domain has devices attached and check if the domain type is >>>> IOMMU_DOMAIN_UNMANAGED. >>> >>> NAK >>> >>> If you need weird HW specific stuff like this then please implement it >>> properly in iommufd, not try and randomly guess what things need from >>> the domain type. >>> >>> All this confidential computing stuff needs a comprehensive solution, >>> not some piecemeal mess. How can you even use a CC guest with VFIO in >>> the upstream kernel? Hmm? >>> >> >> Currently all guest devices are untrusted - whether they are emulated, >> virtio or passthrough. In the current use case of VFIO device-passthrough to >> an SNP guest, the pass-through device will perform DMA to un-encrypted or >> shared guest memory, in the same way as virtio or emulated devices. >> >> This fix is prompted by an issue reported by Nvidia, they are trying to do >> PCIe device passthrough to SNP guest. The memory allocated for DMA is >> through dma_alloc_coherent() in the SNP guest and during DMA I/O an >> RMP_PAGE_FAULT is observed on the host. >> >> These dma_alloc_coherent() calls map into page state change hypercalls into >> the host to change guest page state from encrypted to shared in the RMP >> table. >> >> Following is a link to issue discussed above: >> https://github.com/AMDESE/AMDSEV/issues/109 > > Wow you should really write all of this in the commmit message > >> Now, to set individual 4K entries to different shared/private >> mappings in NPT or host page tables for large page entries, the RMP >> and NPT/host page table large page entries are split to 4K pte’s. > > Why are mappings to private pages even in the iommu in the first > place - and how did they even get there? > You seem to be confusing between host/NPT page tables and IOMMU page tables. There are no private page mappings in the IOMMU page tables, as i mentioned above currently all DMA to SNP guest is to/from shared memory. > I thought the design for the private memory was walling it off in a > memfd and making it un-gup'able? > > This seems to be your actual problem, somehow the iommu is being > loaded with private memory PFNs instead of only being loaded with > shared PFNs when shared mappings are created? > The IOMMU page tables are loaded with shared PFNs and not private memory PFNs. > If the IOMMU mappings actually only extend to the legitimate shared > pages then you don't have a problem with large IOPTEs spanning a > mixture of page types. > >> The fix is to force 4K page size for IOMMU page tables for SNP guests. > > But even if you want to persue this as the fix, it should not be done > in this way. > >> This patch-set adds support to detect if a domain belongs to an SNP-enabled >> guest. This way it can set default page size of a domain to 4K only for >> SNP-enabled guest and allow non-SNP guest to use larger page size. > > As I said, the KVM has nothing to do with the iommu and I want to > laregly keep it that way. > > If the VMM needs to request a 4k page size only iommu_domain because > it is somehow mapping mixtures of private and public pages, Again, there is no mixture of private and public pages, the IOMMU only has shared page mappings. Thanks, Ashish >then the > VMM knows it is doing this crazy thing and it needs to ask iommufd > directly for customized iommu_domain from the driver. > > No KVM interconnection. > > In fact, we already have a way to do this in iommufd generically, have > the VMM set IOMMU_OPTION_HUGE_PAGES = 0. > > Jason >
On Fri, Jan 20, 2023 at 09:12:26AM -0600, Kalra, Ashish wrote: > On 1/19/2023 11:44 AM, Jason Gunthorpe wrote: > > On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote: > > > Hello Jason, > > > > > > On 1/13/2023 9:33 AM, Jason Gunthorpe wrote: > > > > On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: > > > > > Currently, to detect if a domain is enabled with VFIO support, the driver > > > > > checks if the domain has devices attached and check if the domain type is > > > > > IOMMU_DOMAIN_UNMANAGED. > > > > > > > > NAK > > > > > > > > If you need weird HW specific stuff like this then please implement it > > > > properly in iommufd, not try and randomly guess what things need from > > > > the domain type. > > > > > > > > All this confidential computing stuff needs a comprehensive solution, > > > > not some piecemeal mess. How can you even use a CC guest with VFIO in > > > > the upstream kernel? Hmm? > > > > > > > > > > Currently all guest devices are untrusted - whether they are emulated, > > > virtio or passthrough. In the current use case of VFIO device-passthrough to > > > an SNP guest, the pass-through device will perform DMA to un-encrypted or > > > shared guest memory, in the same way as virtio or emulated devices. > > > > > > This fix is prompted by an issue reported by Nvidia, they are trying to do > > > PCIe device passthrough to SNP guest. The memory allocated for DMA is > > > through dma_alloc_coherent() in the SNP guest and during DMA I/O an > > > RMP_PAGE_FAULT is observed on the host. > > > > > > These dma_alloc_coherent() calls map into page state change hypercalls into > > > the host to change guest page state from encrypted to shared in the RMP > > > table. > > > > > > Following is a link to issue discussed above: > > > https://github.com/AMDESE/AMDSEV/issues/109 > > > > Wow you should really write all of this in the commmit message > > > > > Now, to set individual 4K entries to different shared/private > > > mappings in NPT or host page tables for large page entries, the RMP > > > and NPT/host page table large page entries are split to 4K pte’s. > > > > Why are mappings to private pages even in the iommu in the first > > place - and how did they even get there? > > > > You seem to be confusing between host/NPT page tables and IOMMU page tables. No, I haven't. I'm repeating what was said: during DMA I/O an RMP_PAGE_FAULT is observed on the host. So, I'm interested to hear how you can get a RMP_PAGE_FAULT from the IOMMU if the IOMMU is only programmed with shared pages that, by (my) definition, are accessible to the CPU and should not generate a RMP_PAGE_FAULT? I think you are confusing my use of the word private with some AMD architecture deatils. When I say private I mean that the host CPU will generate a violation if it tries to access the memory. I think the conclusion is logical - if the IOMMU is experiencing a protection violation it is because the IOMMU was programed with PFNs it is not allowed to access - and so why was that even done in the first place? I suppose what is going on is you program the IOPTEs with PFNs of unknown state and when the PFN changes access protections the IOMMU can simply use it without needing to synchronize with the access protection change. And your problem is that the granularity of access protection change does not match the IOPTE granularity in the IOMMU. But this seems very wasteful as the IOMMU will be using IOPTEs and also will pin the memory when the systems *knows* this memory cannot be accessed through the IOMMU. It seems much better to dynamically establish IOMMU mappings only when you learn that the memory is actually accesisble to the IOMMU. Also, I thought the leading plan for CC was to use the memfd approach here: https://lore.kernel.org/kvm/20220915142913.2213336-1-chao.p.peng@linux.intel.com/ Which prevents mmaping the memory to userspace - so how did it get into the IOMMU in the first place? Jason
Hello Jason, On 1/20/2023 10:13 AM, Jason Gunthorpe wrote: > On Fri, Jan 20, 2023 at 09:12:26AM -0600, Kalra, Ashish wrote: >> On 1/19/2023 11:44 AM, Jason Gunthorpe wrote: >>> On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote: >>>> Hello Jason, >>>> >>>> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote: >>>>> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote: >>>>>> Currently, to detect if a domain is enabled with VFIO support, the driver >>>>>> checks if the domain has devices attached and check if the domain type is >>>>>> IOMMU_DOMAIN_UNMANAGED. >>>>> >>>>> NAK >>>>> >>>>> If you need weird HW specific stuff like this then please implement it >>>>> properly in iommufd, not try and randomly guess what things need from >>>>> the domain type. >>>>> >>>>> All this confidential computing stuff needs a comprehensive solution, >>>>> not some piecemeal mess. How can you even use a CC guest with VFIO in >>>>> the upstream kernel? Hmm? >>>>> >>>> >>>> Currently all guest devices are untrusted - whether they are emulated, >>>> virtio or passthrough. In the current use case of VFIO device-passthrough to >>>> an SNP guest, the pass-through device will perform DMA to un-encrypted or >>>> shared guest memory, in the same way as virtio or emulated devices. >>>> >>>> This fix is prompted by an issue reported by Nvidia, they are trying to do >>>> PCIe device passthrough to SNP guest. The memory allocated for DMA is >>>> through dma_alloc_coherent() in the SNP guest and during DMA I/O an >>>> RMP_PAGE_FAULT is observed on the host. >>>> >>>> These dma_alloc_coherent() calls map into page state change hypercalls into >>>> the host to change guest page state from encrypted to shared in the RMP >>>> table. >>>> >>>> Following is a link to issue discussed above: >>>> https://github.com/AMDESE/AMDSEV/issues/109 >>> >>> Wow you should really write all of this in the commmit message >>> >>>> Now, to set individual 4K entries to different shared/private >>>> mappings in NPT or host page tables for large page entries, the RMP >>>> and NPT/host page table large page entries are split to 4K pte’s. >>> >>> Why are mappings to private pages even in the iommu in the first >>> place - and how did they even get there? >>> >> >> You seem to be confusing between host/NPT page tables and IOMMU page tables. > > No, I haven't. I'm repeating what was said: > > during DMA I/O an RMP_PAGE_FAULT is observed on the host. > > So, I'm interested to hear how you can get a RMP_PAGE_FAULT from the > IOMMU if the IOMMU is only programmed with shared pages that, by (my) > definition, are accessible to the CPU and should not generate a > RMP_PAGE_FAULT? Yes, sorry i got confused with your use of the word private as you mention below. We basically get the RMP #PF from the IOMMU because there is a page size mismatch between the RMP table and the IOMMU page table. The RMP table's large page entry has been smashed to 4K PTEs to handle page state change to shared on 4K mappings, so this change has to be synced up with the IOMMU page table, otherwise there is now a page size mismatch between RMP table and IOMMU page table which causes the RMP #PF. Thanks, Ashish > > I think you are confusing my use of the word private with some AMD > architecture deatils. When I say private I mean that the host CPU will > generate a violation if it tries to access the memory. > > I think the conclusion is logical - if the IOMMU is experiencing a > protection violation it is because the IOMMU was programed with PFNs > it is not allowed to access - and so why was that even done in the > first place? > > I suppose what is going on is you program the IOPTEs with PFNs of > unknown state and when the PFN changes access protections the IOMMU > can simply use it without needing to synchronize with the access > protection change. And your problem is that the granularity of access > protection change does not match the IOPTE granularity in the IOMMU. > > But this seems very wasteful as the IOMMU will be using IOPTEs and > also will pin the memory when the systems *knows* this memory cannot > be accessed through the IOMMU. It seems much better to dynamically > establish IOMMU mappings only when you learn that the memory is > actually accesisble to the IOMMU. > > Also, I thought the leading plan for CC was to use the memfd approach here: > > https://lore.kernel.org/kvm/20220915142913.2213336-1-chao.p.peng@linux.intel.com/ > > Which prevents mmaping the memory to userspace - so how did it get > into the IOMMU in the first place? > > Jason >
On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote: > We basically get the RMP #PF from the IOMMU because there is a page size > mismatch between the RMP table and the IOMMU page table. The RMP table's > large page entry has been smashed to 4K PTEs to handle page state change to > shared on 4K mappings, so this change has to be synced up with the IOMMU > page table, otherwise there is now a page size mismatch between RMP table > and IOMMU page table which causes the RMP #PF. I understand that, you haven't answered my question: Why is the IOMMU being programmed with pages it cannot access in the first place? Don't do that is the obvious solution there, and preserves huge page IO performance. Jason
On 1/20/2023 11:50 AM, Jason Gunthorpe wrote: > On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote: > >> We basically get the RMP #PF from the IOMMU because there is a page size >> mismatch between the RMP table and the IOMMU page table. The RMP table's >> large page entry has been smashed to 4K PTEs to handle page state change to >> shared on 4K mappings, so this change has to be synced up with the IOMMU >> page table, otherwise there is now a page size mismatch between RMP table >> and IOMMU page table which causes the RMP #PF. > > I understand that, you haven't answered my question: > > Why is the IOMMU being programmed with pages it cannot access in the > first place? > I believe the IOMMU page tables are setup as part of device pass-through to be able to do DMA to all of the guest memory, but i am not an IOMMU expert, so i will let Suravee elaborate on this. Thanks, Ashish > Don't do that is the obvious solution there, and preserves huge page > IO performance. > > Jason >
On 1/20/23 13:55, Kalra, Ashish wrote: > On 1/20/2023 11:50 AM, Jason Gunthorpe wrote: >> On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote: >> >>> We basically get the RMP #PF from the IOMMU because there is a page size >>> mismatch between the RMP table and the IOMMU page table. The RMP table's >>> large page entry has been smashed to 4K PTEs to handle page state >>> change to >>> shared on 4K mappings, so this change has to be synced up with the IOMMU >>> page table, otherwise there is now a page size mismatch between RMP table >>> and IOMMU page table which causes the RMP #PF. >> >> I understand that, you haven't answered my question: >> >> Why is the IOMMU being programmed with pages it cannot access in the >> first place? >> > > I believe the IOMMU page tables are setup as part of device pass-through > to be able to do DMA to all of the guest memory, but i am not an IOMMU > expert, so i will let Suravee elaborate on this. Right. And what I believe Jason is saying is, that for SNP, since we know we can only DMA into shared pages, there is no need to setup the initial IOMMU page tables for all of guest memory. Instead, wait and set up IOMMU mappings when we do a page state change to shared and remove any mappings when we do a page state change to private. Thanks, Tom > > Thanks, > Ashish > >> Don't do that is the obvious solution there, and preserves huge page >> IO performance. >> >> Jason >>
On Fri, Jan 20, 2023 at 04:42:26PM -0600, Tom Lendacky wrote: > On 1/20/23 13:55, Kalra, Ashish wrote: > > On 1/20/2023 11:50 AM, Jason Gunthorpe wrote: > > > On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote: > > > > > > > We basically get the RMP #PF from the IOMMU because there is a page size > > > > mismatch between the RMP table and the IOMMU page table. The RMP table's > > > > large page entry has been smashed to 4K PTEs to handle page > > > > state change to > > > > shared on 4K mappings, so this change has to be synced up with the IOMMU > > > > page table, otherwise there is now a page size mismatch between RMP table > > > > and IOMMU page table which causes the RMP #PF. > > > > > > I understand that, you haven't answered my question: > > > > > > Why is the IOMMU being programmed with pages it cannot access in the > > > first place? > > > > > > > I believe the IOMMU page tables are setup as part of device pass-through > > to be able to do DMA to all of the guest memory, but i am not an IOMMU > > expert, so i will let Suravee elaborate on this. > > Right. And what I believe Jason is saying is, that for SNP, since we know we > can only DMA into shared pages, there is no need to setup the initial IOMMU > page tables for all of guest memory. Instead, wait and set up IOMMU mappings > when we do a page state change to shared and remove any mappings when we do > a page state change to private. Correct. I don't know the details of how the shared/private works on AMD, eg if the hypervisor even knows of the private/shared transformation.. At the very worst I suppose if you just turn on the vIOMMU it should just start working as the vIOMMU mode should make the paging dynamic. eg virtio-iommu or something general might even do the job. Pinning pages is expensive, breaks swap, defragmentation and wastes a lot of iommu memory. Given that the bounce buffers really shouldn't be reallocated constantly I'd expect vIOMMU to be performance OK. This solves the page size mistmatch issue because the iommu never has a PFN installed that would generate a RMP #PF. The iommu can continue to use large page sizes whenever possible. It seems to me the current approach of just stuffing all memory into the iommu is a just a shortcut to get something working. Otherwise, what I said before is still the case: only the VMM knows what it is doing. It knows if it is using a model where it programs private memory into the IOMMU so it knows if it should ask the kernel to use a 4k iommu page size. Trying to have the kernel guess what userspace is doing based on the kvm is simply architecturally wrong. Jason
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h index 1d0a70c85333..ad124959a26a 100644 --- a/drivers/iommu/amd/amd_iommu_types.h +++ b/drivers/iommu/amd/amd_iommu_types.h @@ -439,6 +439,7 @@ translation */ #define PD_IOMMUV2_MASK (1UL << 3) /* domain has gcr3 table */ #define PD_GIOV_MASK (1UL << 4) /* domain enable GIOV support */ +#define PD_VFIO_MASK (1UL << 5) /* domain enable VFIO support */ extern bool amd_iommu_dump; #define DUMP_printk(format, arg...) \ diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 3847f3bdc568..681ab1fdb7d5 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -2056,6 +2056,8 @@ static struct protection_domain *protection_domain_alloc(unsigned int type) mode = PAGE_MODE_NONE; } else if (type == IOMMU_DOMAIN_UNMANAGED) { pgtable = AMD_IOMMU_V1; + /* Mark unmanaged domain for VFIO */ + domain->flags |= PD_VFIO_MASK; } switch (pgtable) { @@ -2130,6 +2132,7 @@ static void amd_iommu_detach_device(struct iommu_domain *dom, struct device *dev) { struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev); + struct protection_domain *domain = to_pdomain(dom); struct amd_iommu *iommu; if (!check_device(dev)) @@ -2144,7 +2147,7 @@ static void amd_iommu_detach_device(struct iommu_domain *dom, #ifdef CONFIG_IRQ_REMAP if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) && - (dom->type == IOMMU_DOMAIN_UNMANAGED)) + (domain->flags & PD_VFIO_MASK)) dev_data->use_vapic = 0; #endif @@ -2176,7 +2179,7 @@ static int amd_iommu_attach_device(struct iommu_domain *dom, #ifdef CONFIG_IRQ_REMAP if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)) { - if (dom->type == IOMMU_DOMAIN_UNMANAGED) + if (domain->flags & PD_VFIO_MASK) dev_data->use_vapic = 1; else dev_data->use_vapic = 0;