Message ID | 20230114073420.759989-1-baolu.lu@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp175570wrn; Fri, 13 Jan 2023 23:52:16 -0800 (PST) X-Google-Smtp-Source: AMrXdXtTvxwRCI6eCtzpM1bMgA0mLTrezBmIMZMa55DfI/hKlJaMnMiABceGzcGD2eKtllwKMOQF X-Received: by 2002:a17:903:2154:b0:193:2ff0:b665 with SMTP id s20-20020a170903215400b001932ff0b665mr11983066ple.15.1673682736334; Fri, 13 Jan 2023 23:52:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673682736; cv=none; d=google.com; s=arc-20160816; b=EmkA0MJyDb1Z0gytuNv9a1sUMHORKvmSg9jwzFvGVXvtAjSrQB8qVTn791iwrTpoOI tv+pT64jZtX6pPzHkg8vjTfcyrpNNlQNFGpjq03ekJ+NsvvTrBEqMLYCddknXovGiQiy zaoAUY1DdsG6WdqKfhCUPepuihMyJ0z95IRlG1Qox9Ve2qqKoDY4qnXG4vNxroIG3T8r V4KTk1+06CL+vv0p+3WUzdpo3K1QO7C93vzMeIQd42oWxYp0Ox3NQXqgRjaJdWWJ+UGZ 3jdq0JL69qLtkXcu0Yu5CsJXUtnM+5PB30HCj5uX5DBcFVMhAqOla5dTZWIGWkgbuSUT ANzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=gjciZsUlbuXUCKfszSljnsZ7vISM/QY0D4Rq21TWhR8=; b=jnDQYs1HbBiJ/2y2u1Sji2KRe9B35DSRz8jUB0auKa4sQyz7jNDNNbDX2rKeZ+cBuG fOyBApiu/AyHBUnxe4j2eTtHO3g4WryWcOLgYWvadPsJzCfnTOhd/7JcU+epUOqxY1tC fTpU8o3bJg5BrST6nrJWN77vdjU55KnronCf6+FaAX02pd++m2han0dCr3ZqKvEHWA5i PgoTgD7FlniIgGjqk6mP7wz0a8AGQNYI2P+Jmue0NTQ8P9pfjQpma5wkCMfY45UfpkK1 g1QzB/UFtcV2Nc04OdxohDIuD7sMMoHx1NDUc3EqNN0dtDFcUK0MFdzuHzU6iGnza6/2 TPRQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZcZ+5MYP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s13-20020a170902a50d00b001910b23667fsi18407236plq.213.2023.01.13.23.52.03; Fri, 13 Jan 2023 23:52:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZcZ+5MYP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229709AbjANHm1 (ORCPT <rfc822;stefanalexe48@gmail.com> + 99 others); Sat, 14 Jan 2023 02:42:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58524 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229722AbjANHmZ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 14 Jan 2023 02:42:25 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7BDF3212B; Fri, 13 Jan 2023 23:42:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673682143; x=1705218143; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=BzXu35sjZYWmDFFcnSKsZ+w+vpfcyMxobL+k+iHSVtg=; b=ZcZ+5MYP4TTyqqZ7HfWS8LQh/EmMbWgvDfmsmW+lRFF0C0j4OV2UnhXb vvOv887+eSkms7s7SZt3O11xyUU/VXczUn5vs11YvtAYOf4uCtURnxi/9 HBLJDVa7lbqTHmd29G3r+NBXktxYvpOQmKPMvaKT3lbuMmp7YJ7VUyKtG WhtJ9vMwNtT1TqZdkvXQSWqfqfL6TIeQDp2mWJ46dOUlTrGugDANIMRo1 XcxtrTTgF5g0aE2pFwNCy8qlSQF218xqDwYcBOV4yBU5tQYeg1AbxCZtz yFx81RBkkJT9Va+dqIrVikCTyCuymexImnYvc7IeB5PVYkhksdYVlI+n5 w==; X-IronPort-AV: E=McAfee;i="6500,9779,10589"; a="326228279" X-IronPort-AV: E=Sophos;i="5.97,216,1669104000"; d="scan'208";a="326228279" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jan 2023 23:42:22 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10589"; a="690738110" X-IronPort-AV: E=Sophos;i="5.97,216,1669104000"; d="scan'208";a="690738110" Received: from allen-box.sh.intel.com ([10.239.159.48]) by orsmga001.jf.intel.com with ESMTP; 13 Jan 2023 23:42:19 -0800 From: Lu Baolu <baolu.lu@linux.intel.com> To: Bjorn Helgaas <bhelgaas@google.com>, Joerg Roedel <jroedel@suse.de> Cc: Matt Fagnani <matt.fagnani@bell.net>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Jason Gunthorpe <jgg@nvidia.com>, Kevin Tian <kevin.tian@intel.com>, Vasant Hegde <vasant.hegde@amd.com>, Tony Zhu <tony.zhu@intel.com>, linux-pci@vger.kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, Lu Baolu <baolu.lu@linux.intel.com> Subject: [PATCH v3 1/1] PCI: Add translated request only flag for pci_enable_pasid() Date: Sat, 14 Jan 2023 15:34:20 +0800 Message-Id: <20230114073420.759989-1-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754983548719426770?= X-GMAIL-MSGID: =?utf-8?q?1754983548719426770?= |
Series |
[v3,1/1] PCI: Add translated request only flag for pci_enable_pasid()
|
|
Commit Message
Baolu Lu
Jan. 14, 2023, 7:34 a.m. UTC
The PCIe fabric routes Memory Requests based on the TLP address, ignoring the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") requires some ACS features being supported on device's upstream path when enabling PCI/PASID. One alternative is ATS/PRI which lets the device resolve the PASID + addr pair before a memory request is made into a routeable TLB address through the translation agent. Those resolved addresses are then cached on the device instead of in the IOMMU TLB and the device always sets translated bit for PASID. One example of those devices are AMD graphic devices that always have ACS or ATS/PRI enabled together with PASID. This adds a flag parameter in the pci_enable_pasid() helper, with which the device driver could opt-in the fact that device always sets the translated bit for PASID. It also applies this opt-in for AMD graphic devices. Without this change, kernel boots to black screen on a system with below AMD graphic device: 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller]) DeviceName: ATI EG BROADWAY Subsystem: Hewlett-Packard Company Device 8332 At present, it is a common practice to enable/disable PCI PASID in the iommu drivers. Considering that the device driver knows more about the specific device, we will follow up by moving pci_enable_pasid() into the specific device drivers. Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865 Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/ Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> --- include/linux/pci-ats.h | 6 ++++-- drivers/iommu/amd/iommu.c | 2 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- drivers/iommu/intel/iommu.c | 3 ++- drivers/pci/ats.c | 8 ++++++-- 5 files changed, 14 insertions(+), 7 deletions(-) Change log: v3: - Replace ATS with ATS/PRI in commit message; - Refine a code comment; - Patch tested by Matt Fagnani. v2:https://lore.kernel.org/linux-iommu/20230113014409.752405-1-baolu.lu@linux.intel.com/ - Convert the bool to a named flag; - Convert TRANSLED to XLATED. v1: - https://lore.kernel.org/linux-iommu/20230112084629.737653-1-baolu.lu@linux.intel.com/
Comments
On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: > The PCIe fabric routes Memory Requests based on the TLP address, ignoring > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: > Enable PASID only when ACS RR & UF enabled on upstream path") requires > some ACS features being supported on device's upstream path when enabling > PCI/PASID. > > One alternative is ATS/PRI which lets the device resolve the PASID + addr > pair before a memory request is made into a routeable TLB address through > the translation agent. Those resolved addresses are then cached on the > device instead of in the IOMMU TLB and the device always sets translated > bit for PASID. One example of those devices are AMD graphic devices that > always have ACS or ATS/PRI enabled together with PASID. > > This adds a flag parameter in the pci_enable_pasid() helper, with which > the device driver could opt-in the fact that device always sets the > translated bit for PASID. > > It also applies this opt-in for AMD graphic devices. Without this change, > kernel boots to black screen on a system with below AMD graphic device: > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) > (prog-if 00 [VGA controller]) > DeviceName: ATI EG BROADWAY > Subsystem: Hewlett-Packard Company Device 8332 > > At present, it is a common practice to enable/disable PCI PASID in the > iommu drivers. Considering that the device driver knows more about the > specific device, we will follow up by moving pci_enable_pasid() into > the specific device drivers. > > Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") > Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865 > Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/ > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Suggested-by: Christian König <christian.koenig@amd.com> > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> > --- > include/linux/pci-ats.h | 6 ++++-- > drivers/iommu/amd/iommu.c | 2 +- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- > drivers/iommu/intel/iommu.c | 3 ++- > drivers/pci/ats.c | 8 ++++++-- > 5 files changed, 14 insertions(+), 7 deletions(-) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
Linux regression tracking (Thorsten Leemhuis)
Jan. 27, 2023, 11:30 a.m. UTC |
#2
Addressed
Unaddressed
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone. What happened to below patch? It looks like there was no progress since ten days now. Or did I miss something? Reminder, the patch is fixing a regression, hence it would be good if this could be fixed rather sooner than later. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. On 16.01.23 16:42, Jason Gunthorpe wrote: > On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: >> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >> Enable PASID only when ACS RR & UF enabled on upstream path") requires >> some ACS features being supported on device's upstream path when enabling >> PCI/PASID. >> >> One alternative is ATS/PRI which lets the device resolve the PASID + addr >> pair before a memory request is made into a routeable TLB address through >> the translation agent. Those resolved addresses are then cached on the >> device instead of in the IOMMU TLB and the device always sets translated >> bit for PASID. One example of those devices are AMD graphic devices that >> always have ACS or ATS/PRI enabled together with PASID. >> >> This adds a flag parameter in the pci_enable_pasid() helper, with which >> the device driver could opt-in the fact that device always sets the >> translated bit for PASID. >> >> It also applies this opt-in for AMD graphic devices. Without this change, >> kernel boots to black screen on a system with below AMD graphic device: >> >> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. >> [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) >> (prog-if 00 [VGA controller]) >> DeviceName: ATI EG BROADWAY >> Subsystem: Hewlett-Packard Company Device 8332 >> >> At present, it is a common practice to enable/disable PCI PASID in the >> iommu drivers. Considering that the device driver knows more about the >> specific device, we will follow up by moving pci_enable_pasid() into >> the specific device drivers. >> >> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") >> Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865 >> Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/ >> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> >> Suggested-by: Christian König <christian.koenig@amd.com> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com> >> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> >> --- >> include/linux/pci-ats.h | 6 ++++-- >> drivers/iommu/amd/iommu.c | 2 +- >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- >> drivers/iommu/intel/iommu.c | 3 ++- >> drivers/pci/ats.c | 8 ++++++-- >> 5 files changed, 14 insertions(+), 7 deletions(-) > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Jason
On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: > The PCIe fabric routes Memory Requests based on the TLP address, ignoring > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: > Enable PASID only when ACS RR & UF enabled on upstream path") requires > some ACS features being supported on device's upstream path when enabling > PCI/PASID. > > One alternative is ATS/PRI which lets the device resolve the PASID + addr > pair before a memory request is made into a routeable TLB address through > the translation agent. This sounds like "ATS/PRI" is a solution to a problem, but we haven't stated the problem yet. > Those resolved addresses are then cached on the > device instead of in the IOMMU TLB and the device always sets translated > bit for PASID. One example of those devices are AMD graphic devices that > always have ACS or ATS/PRI enabled together with PASID. > > This adds a flag parameter in the pci_enable_pasid() helper, with which > the device driver could opt-in the fact that device always sets the > translated bit for PASID. Nit: "Add a flag ..." and "Apply this opt-in ..." (below). > It also applies this opt-in for AMD graphic devices. Without this change, > kernel boots to black screen on a system with below AMD graphic device: > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) > (prog-if 00 [VGA controller]) > DeviceName: ATI EG BROADWAY > Subsystem: Hewlett-Packard Company Device 8332 What is the underlying failure here? "Black screen" is useful but we should say *why* that happens, e.g., transactions went the wrong place or whatever. > At present, it is a common practice to enable/disable PCI PASID in the > iommu drivers. Considering that the device driver knows more about the > specific device, we will follow up by moving pci_enable_pasid() into > the specific device drivers. > @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev) > * pci_enable_pasid - Enable the PASID capability > * @pdev: PCI device structure > * @features: Features to enable > + * @flags: device-specific flags > + * - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type > + * for all PASID memory requests. s/use/uses/ I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows, right? We can't deduce from architected config space that the device will produce PASID prefixes for every Memory Request, can we? Bjorn
> From: Bjorn Helgaas <helgaas@kernel.org> > Sent: Saturday, January 28, 2023 1:31 AM > > On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: > > > @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev) > > * pci_enable_pasid - Enable the PASID capability > > * @pdev: PCI device structure > > * @features: Features to enable > > + * @flags: device-specific flags > > + * - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated > type > > + * for all PASID memory requests. > > s/use/uses/ > > I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows, > right? We can't deduce from architected config space that the device Yes > will produce PASID prefixes for every Memory Request, can we? > No, we can't. PASID cap only indicates that the device is capable of using PASID prefix, not a mandatory requirement on every memory request. Similar case is PRI. Having PRI enabled only means the device is capable of handling I/O page faults, not the indicator that it can 100% tolerate fault in every memory access. That is the main reason why vfio/iommufd can't simply skip pinning guest memory when seeing PRI enabled (well, though PRI is not supported yet). It has to be opted-in via a driver specific way e.g. a vfio variant driver. Thanks Kevin
Hi Bjorn, Thanks for your review comments. On 2023/1/28 1:30, Bjorn Helgaas wrote: > On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: >> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >> Enable PASID only when ACS RR & UF enabled on upstream path") requires >> some ACS features being supported on device's upstream path when enabling >> PCI/PASID. >> >> One alternative is ATS/PRI which lets the device resolve the PASID + addr >> pair before a memory request is made into a routeable TLB address through >> the translation agent. > > This sounds like "ATS/PRI" is a solution to a problem, but we haven't > stated the problem yet. > >> Those resolved addresses are then cached on the >> device instead of in the IOMMU TLB and the device always sets translated >> bit for PASID. One example of those devices are AMD graphic devices that >> always have ACS or ATS/PRI enabled together with PASID. >> >> This adds a flag parameter in the pci_enable_pasid() helper, with which >> the device driver could opt-in the fact that device always sets the >> translated bit for PASID. > > Nit: "Add a flag ..." and "Apply this opt-in ..." (below). > >> It also applies this opt-in for AMD graphic devices. Without this change, >> kernel boots to black screen on a system with below AMD graphic device: >> >> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. >> [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) >> (prog-if 00 [VGA controller]) >> DeviceName: ATI EG BROADWAY >> Subsystem: Hewlett-Packard Company Device 8332 > > What is the underlying failure here? "Black screen" is useful but we > should say *why* that happens, e.g., transactions went the wrong place > or whatever. All above make sense to me. I post my new commit message at the end of this reply. > >> At present, it is a common practice to enable/disable PCI PASID in the >> iommu drivers. Considering that the device driver knows more about the >> specific device, we will follow up by moving pci_enable_pasid() into >> the specific device drivers. > >> @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev) >> * pci_enable_pasid - Enable the PASID capability >> * @pdev: PCI device structure >> * @features: Features to enable >> + * @flags: device-specific flags >> + * - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type >> + * for all PASID memory requests. > > s/use/uses/ Yes. > > I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows, > right? We can't deduce from architected config space that the device > will produce PASID prefixes for every Memory Request, can we? No, we can't. That's the reason why we need a flag here. [ Below is an updated commit message. Hope it can describe things clearly.] PCI: Add translated request only flag for pci_enable_pasid() The PCIe fabric routes Memory Requests based on the TLP address, ignoring the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") requires some ACS features being supported on device's upstream path when enabling PCI/PASID. However, above change causes the Linux kernel boots to black screen on a system with below graphic device: 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller]) DeviceName: ATI EG BROADWAY Subsystem: Hewlett-Packard Company Device 8332 The kernel trace looks like below: Call Trace: <TASK> amd_iommu_attach_device+0x2e0/0x300 __iommu_attach_device+0x1b/0x90 iommu_attach_group+0x65/0xa0 amd_iommu_init_device+0x16b/0x250 [iommu_v2] kfd_iommu_resume+0x4c/0x1a0 [amdgpu] kgd2kfd_resume_iommu+0x12/0x30 [amdgpu] kgd2kfd_device_init.cold+0x346/0x49a [amdgpu] amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu] amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu] ? _raw_spin_lock_irqsave+0x23/0x50 amdgpu_driver_load_kms+0x15/0x110 [amdgpu] amdgpu_pci_probe+0x161/0x370 [amdgpu] local_pci_probe+0x41/0x80 pci_device_probe+0xb3/0x220 really_probe+0xde/0x380 ? pm_runtime_barrier+0x50/0x90 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xce/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x73/0xa0 bus_add_driver+0x1ae/0x200 driver_register+0x89/0xe0 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x59/0x230 do_init_module+0x4a/0x200 __do_sys_init_module+0x157/0x180 do_syscall_64+0x5b/0x80 ? handle_mm_fault+0xff/0x2f0 ? do_user_addr_fault+0x1ef/0x690 ? exc_page_fault+0x70/0x170 entry_SYSCALL_64_after_hwframe+0x72/0xdc The AMD iommu driver allocates a new domain (called v2 domain) for the amdgpu device and enables its PCI PASID/ATS/PRI before attaching the v2 domain to it. The failure of pci_enable_pasid() due to lack of ACS causes the domain attaching device to fail. The amdgpu device is unable to DMA normally, resulting in a black screen of the system. However, this device is special as it relies on ATS/PRI to resolve the PASID + addr pair before a memory request is made into a routeable TLB address through the translation agent. Those resolved addresses are then cached on the device instead of in the IOMMU TLB and the device always uses translated memory request for PASID. ACS is not necessary for the devices that always use translated memory request for PASID. But this is device specific and only device driver knows this. We can't deduce this from architected config space. Add a flag for pci_enable_pasid(), with which the device drivers could opt-in the fact that device always uses translated memory requests for PASID hence the ACS is not a necessity. Apply this opt-in for above AMD graphic device. At present, it is a common practice to enable/disable PCI PASID in the iommu drivers. Considering that the device driver knows more about the specific device, it's better to move pci_enable_pasid() into the specific device drivers. [-- end --] -- Best regards, baolu
On Sun, Jan 29, 2023 at 04:42:32PM +0800, Baolu Lu wrote: > On 2023/1/28 1:30, Bjorn Helgaas wrote: > > On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote: > > > The PCIe fabric routes Memory Requests based on the TLP address, ignoring > > > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: > > > Enable PASID only when ACS RR & UF enabled on upstream path") requires > > > some ACS features being supported on device's upstream path when enabling > > > PCI/PASID. > > > > > > One alternative is ATS/PRI which lets the device resolve the PASID + addr > > > pair before a memory request is made into a routeable TLB address through > > > the translation agent. > > > > This sounds like "ATS/PRI" is a solution to a problem, but we haven't > > stated the problem yet. > > > > > Those resolved addresses are then cached on the > > > device instead of in the IOMMU TLB and the device always sets translated > > > bit for PASID. One example of those devices are AMD graphic devices that > > > always have ACS or ATS/PRI enabled together with PASID. > > > > > > This adds a flag parameter in the pci_enable_pasid() helper, with which > > > the device driver could opt-in the fact that device always sets the > > > translated bit for PASID. > > > > > > It also applies this opt-in for AMD graphic devices. Without this change, > > > kernel boots to black screen on a system with below AMD graphic device: > > > > > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. > > > [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) > > > (prog-if 00 [VGA controller]) > > > DeviceName: ATI EG BROADWAY > > > Subsystem: Hewlett-Packard Company Device 8332 > > > > What is the underlying failure here? "Black screen" is useful but we > > should say *why* that happens, e.g., transactions went the wrong place > > or whatever. > > I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows, > > right? We can't deduce from architected config space that the device > > will produce PASID prefixes for every Memory Request, can we? > > No, we can't. That's the reason why we need a flag here. Sorry, I'm still confused. PCI_PASID_XLATED_REQ_ONLY is a device-specific property, and you want to opt-in AMD graphics devices. Where's the AMD graphics-specific change? The current patch does this: pdev_pri_ats_enable pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY) which looks like it does it for *all* devices below an AMD IOMMU, without any device or driver input. PCIe r6.0, sec 6.20.1: A Function is not permitted to generate Requests using Translated Addresses and a PASID unless both PASID Enable and Translated Requests with PASID Enable are Set. You want AMD graphics devices to do DMA with translated addresses and PASID, right? pci_enable_pasid() sets PASID Enable (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests with PASID Enable" is set. We don't even have a #define for it. I would think we should check "Translated Requests with PASID Supported" before setting "Translated Requests with PASID Enable", too? > PCI: Add translated request only flag for pci_enable_pasid() > > The PCIe fabric routes Memory Requests based on the TLP address, ignoring > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: > Enable PASID only when ACS RR & UF enabled on upstream path") requires > some ACS features being supported on device's upstream path when enabling > PCI/PASID. > > However, above change causes the Linux kernel boots to black screen on a > system with below graphic device: We need a PCIe concept-level description of the issue first, i.e., in terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue as an instance. > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) > (prog-if 00 [VGA controller]) > DeviceName: ATI EG BROADWAY > Subsystem: Hewlett-Packard Company Device 8332 > > The kernel trace looks like below: > > Call Trace: > <TASK> > amd_iommu_attach_device+0x2e0/0x300 > __iommu_attach_device+0x1b/0x90 > iommu_attach_group+0x65/0xa0 > amd_iommu_init_device+0x16b/0x250 [iommu_v2] > kfd_iommu_resume+0x4c/0x1a0 [amdgpu] > kgd2kfd_resume_iommu+0x12/0x30 [amdgpu] > kgd2kfd_device_init.cold+0x346/0x49a [amdgpu] > amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu] > amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu] > ? _raw_spin_lock_irqsave+0x23/0x50 > amdgpu_driver_load_kms+0x15/0x110 [amdgpu] > amdgpu_pci_probe+0x161/0x370 [amdgpu] > local_pci_probe+0x41/0x80 > pci_device_probe+0xb3/0x220 > really_probe+0xde/0x380 > ? pm_runtime_barrier+0x50/0x90 > __driver_probe_device+0x78/0x170 > driver_probe_device+0x1f/0x90 > __driver_attach+0xce/0x1c0 > ? __pfx___driver_attach+0x10/0x10 > bus_for_each_dev+0x73/0xa0 > bus_add_driver+0x1ae/0x200 > driver_register+0x89/0xe0 > ? __pfx_init_module+0x10/0x10 [amdgpu] > do_one_initcall+0x59/0x230 > do_init_module+0x4a/0x200 > __do_sys_init_module+0x157/0x180 > do_syscall_64+0x5b/0x80 > ? handle_mm_fault+0xff/0x2f0 > ? do_user_addr_fault+0x1ef/0x690 > ? exc_page_fault+0x70/0x170 > entry_SYSCALL_64_after_hwframe+0x72/0xdc The stack trace doesn't seem like it shows a failure, so I'm not sure it's useful this time. If it is, we can at least strip out the irrelevant pieces. > The AMD iommu driver allocates a new domain (called v2 domain) for the "v2 domain" needs to be something greppable -- an identifier, filename, etc. > amdgpu device and enables its PCI PASID/ATS/PRI before attaching the > v2 domain to it. The failure of pci_enable_pasid() due to lack of ACS > causes the domain attaching device to fail. The amdgpu device is unable > to DMA normally, resulting in a black screen of the system. > > However, this device is special as it relies on ATS/PRI to resolve the > PASID + addr pair before a memory request is made into a routeable TLB > address through the translation agent. Those resolved addresses are then > cached on the device instead of in the IOMMU TLB and the device always > uses translated memory request for PASID. > > ACS is not necessary for the devices that always use translated memory > request for PASID. But this is device specific and only device driver > knows this. We can't deduce this from architected config space. > > Add a flag for pci_enable_pasid(), with which the device drivers could > opt-in the fact that device always uses translated memory requests for > PASID hence the ACS is not a necessity. Apply this opt-in for above AMD > graphic device. > > At present, it is a common practice to enable/disable PCI PASID in the > iommu drivers. Considering that the device driver knows more about the > specific device, it's better to move pci_enable_pasid() into the specific > device drivers. > [-- end --] > > -- > Best regards, > baolu
On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote: > Sorry, I'm still confused. PCI_PASID_XLATED_REQ_ONLY is a > device-specific property, and you want to opt-in AMD graphics devices. > Where's the AMD graphics-specific change? The current patch does > this: > > pdev_pri_ats_enable > pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY) > > which looks like it does it for *all* devices below an AMD IOMMU, > without any device or driver input. AMD GPU has a private interface to AMD IOMMU to support PASID support that only it uses. When AMD's IOMMU driver is cleaned up to use the common PASID API this will have to be moved into the GPU driver. But we are not there yet.. Jason
On 2023/1/31 2:38, Bjorn Helgaas wrote: > PCIe r6.0, sec 6.20.1: > > A Function is not permitted to generate Requests using Translated > Addresses and a PASID unless both PASID Enable and Translated > Requests with PASID Enable are Set. > > You want AMD graphics devices to do DMA with translated addresses and > PASID, right? pci_enable_pasid() sets PASID Enable > (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests > with PASID Enable" is set. We don't even have a #define for it. > > I would think we should check "Translated Requests with PASID > Supported" before setting "Translated Requests with PASID Enable", > too? This seems to be an ECN for PCIe 5.x: https://members.pcisig.com/wg/PCI-SIG/document/14929 What I read from this ECN is that, With this ECN, translated memory requests for PASIDs are not allowed to carry a PASID prefix if "Translated Requests with PASID Enabled" is not set. It does not mean whether the device can generate translated memory requests for PASID, but whether the memory request can carry a PASID prefix. Best regards, baolu
On 2023/1/31 2:38, Bjorn Helgaas wrote: >> PCI: Add translated request only flag for pci_enable_pasid() >> >> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >> Enable PASID only when ACS RR & UF enabled on upstream path") requires >> some ACS features being supported on device's upstream path when enabling >> PCI/PASID. >> >> However, above change causes the Linux kernel boots to black screen on a >> system with below graphic device: > We need a PCIe concept-level description of the issue first, i.e., in > terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue > as an instance. How about below description? PCIe endpoints can use ATS to request DMA remapping hardware to translate an IOVA to its mapped physical address. If the translation is missing or the permissions are insufficient, the PRI is used to trigger an I/O page fault. The IOMMU driver will fill the mapping with desired permissions and return the translated address to the device. The translated address is specified by the IOMMU driver. The IOMMU driver ensures that the address is a DMA buffer address instead of any P2P address in the PCI fabric. Therefore, any translated memory request will eventually be routed to IOMMU regardless of whether there is ACS control in the up-streaming path. AMD GPU is one of those devices. Furthermore, it always uses translated memory requests for PASID. > >> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. >> [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) >> (prog-if 00 [VGA controller]) >> DeviceName: ATI EG BROADWAY >> Subsystem: Hewlett-Packard Company Device 8332 >> >> The kernel trace looks like below: >> >> Call Trace: >> <TASK> >> amd_iommu_attach_device+0x2e0/0x300 >> __iommu_attach_device+0x1b/0x90 >> iommu_attach_group+0x65/0xa0 >> amd_iommu_init_device+0x16b/0x250 [iommu_v2] >> kfd_iommu_resume+0x4c/0x1a0 [amdgpu] >> kgd2kfd_resume_iommu+0x12/0x30 [amdgpu] >> kgd2kfd_device_init.cold+0x346/0x49a [amdgpu] >> amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu] >> amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu] >> ? _raw_spin_lock_irqsave+0x23/0x50 >> amdgpu_driver_load_kms+0x15/0x110 [amdgpu] >> amdgpu_pci_probe+0x161/0x370 [amdgpu] >> local_pci_probe+0x41/0x80 >> pci_device_probe+0xb3/0x220 >> really_probe+0xde/0x380 >> ? pm_runtime_barrier+0x50/0x90 >> __driver_probe_device+0x78/0x170 >> driver_probe_device+0x1f/0x90 >> __driver_attach+0xce/0x1c0 >> ? __pfx___driver_attach+0x10/0x10 >> bus_for_each_dev+0x73/0xa0 >> bus_add_driver+0x1ae/0x200 >> driver_register+0x89/0xe0 >> ? __pfx_init_module+0x10/0x10 [amdgpu] >> do_one_initcall+0x59/0x230 >> do_init_module+0x4a/0x200 >> __do_sys_init_module+0x157/0x180 >> do_syscall_64+0x5b/0x80 >> ? handle_mm_fault+0xff/0x2f0 >> ? do_user_addr_fault+0x1ef/0x690 >> ? exc_page_fault+0x70/0x170 >> entry_SYSCALL_64_after_hwframe+0x72/0xdc > The stack trace doesn't seem like it shows a failure, so I'm not sure > it's useful this time. If it is, we can at least strip out the > irrelevant pieces. I will drop above from the commit message. > >> The AMD iommu driver allocates a new domain (called v2 domain) for the > "v2 domain" needs to be something greppable -- an identifier, > filename, etc. > The code reads, 2052 if (iommu_feature(iommu, FEATURE_GT) && 2053 iommu_feature(iommu, FEATURE_PPR)) { 2054 iommu->is_iommu_v2 = true; So, how about ..The AMD GPU has a private interface to its own AMD IOMMU, which could be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu driver allocates a special domain for the GPU device .. ? Best regards, baolu
On Mon, Jan 30, 2023 at 02:47:32PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote: > > > Sorry, I'm still confused. PCI_PASID_XLATED_REQ_ONLY is a > > device-specific property, and you want to opt-in AMD graphics devices. > > Where's the AMD graphics-specific change? The current patch does > > this: > > > > pdev_pri_ats_enable > > pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY) > > > > which looks like it does it for *all* devices below an AMD IOMMU, > > without any device or driver input. > > AMD GPU has a private interface to AMD IOMMU to support PASID support > that only it uses. What is it that makes this a private interface? It seems like we will end up enabling PASID on any device below an AMD v2 IOMMU. I do see the DRM amdgpu_amdkfd_device_init() path that ends up at domain_enable_v2(). Maybe that's it? But amd_iommu_domain_alloc() also leads to domain_enable_v2(), and that's pretty generic, so it looks like we set PD_IOMMUV2_MASK whenever the IOMMU supports it. And it seems like we call pci_enable_pasid() for all endpoints below a v2 IOMMU. Of course, if the endpoint doesn't have a PASID cap, pci_enable_pasid() fails (and pdev_pri_ats_enable() probably splats useless warnings when we try to disable PRI and PASID which haven't been enabled). I guess I'm trying to convince myself that no harm in enabling PASID for any device below an AMD v2 IOMMU. But I don't think a device is *required* to use translated addresses with PASID, and if it uses untranslated addresses with PASID, don't we need ACS to avoid misrouting? Bjorn
On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote: > On 2023/1/31 2:38, Bjorn Helgaas wrote: > > > PCI: Add translated request only flag for pci_enable_pasid() > > > > > > The PCIe fabric routes Memory Requests based on the TLP address, ignoring > > > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: > > > Enable PASID only when ACS RR & UF enabled on upstream path") requires > > > some ACS features being supported on device's upstream path when enabling > > > PCI/PASID. Looking up 201007ef707a to see what ensuring system integrity means, it prevents Memory Requests with PASID, which should always be routed to the RC, from being mistakenly routed as peer-to-peer requests. > > > However, above change causes the Linux kernel boots to black screen on a > > > system with below graphic device: > > > > We need a PCIe concept-level description of the issue first, i.e., in > > terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue > > as an instance. > > How about below description? Thanks, this is exactly the sort of thing I'm looking for. But my understanding of ATS/PRI/PASID is weak, so I'm still working through this. Tell me when I say something wrong below... > PCIe endpoints can use ATS to request DMA remapping hardware to > translate an IOVA to its mapped physical address. If the translation is > missing or the permissions are insufficient, the PRI is used to trigger > an I/O page fault. The IOMMU driver will fill the mapping with desired > permissions and return the translated address to the device. In PCIe spec language, I think you're saying that a PCIe Function may contain an ATC. If the ATC Capability Enable bit is set, the Function can issue Translation Requests. The TA (aka IOMMU) will respond with a Translation Completion. If the Completion is a CplD, it contains the translated address and the Function can store the entry in its ATC. I assume the I/O page fault case corresponds to a Cpl (with no data) meaning that the TA could not translate the address. If the TA doesn't have a mapping with the desired permissions, and the Function's Page Request Capability Enable bit is set, it may issue a Page Request Message. It's up to the TA/IOMMU to make this message visible to the OS, which can make the page resident, create an IOMMU mapping, and enable a PRG Response Message. After the Function receives the PRG Response Message, it would issue another Translation Request. > The translated address is specified by the IOMMU driver. The IOMMU > driver ensures that the address is a DMA buffer address instead of any > P2P address in the PCI fabric. Therefore, any translated memory request > will eventually be routed to IOMMU regardless of whether there is ACS > control in the up-streaming path. A Memory Request with an address that is not a P2P address, i.e., it is not contained in any bridge aperture, will *always* be routed toward the RC, won't it? Isn't that the case regardless of whether the address is translated or untranslated, and even regardless of ACS? IIUC, ACS basically causes peer-to-peer requests to be routed upstream instead of directly to the peer. OK, reading this again, I realize that I just restated exactly what you had already written, sorry about that. > AMD GPU is one of those devices. I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? And furthermore, that the GPU *always* uses Translated addresses with PASID? So I guess what's going on here is that if: - A device only uses PASID with Translated addresses, and - those Translated addresses are never P2P addresses, then - those transactions will always be routed to the RC. And this applies even if there is no ACS or ACS doesn't support PCI_ACS_RR and PCI_ACS_UF. The black screen happens because ... ? What can we include in the commit log to help people find this fix? I see these in the bugzilla: WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 (These look like defects in pdev_pri_ats_enable(), so really just distractions) kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874 kfd kfd: amdgpu: device 1002:9874 NOT added due to errors BUG: kernel NULL pointer dereference, address: 0000000000000058 RIP: 0010:report_iommu_fault+0x11/0x90 I couldn't figure out the NULL pointer dereference. I expected it to be from a BUG() or similar in report_iommu_fault(), but I don't see that. > Furthermore, it always uses translated memory requests for PASID. > > > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. > > > [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) > > > (prog-if 00 [VGA controller]) > > > DeviceName: ATI EG BROADWAY > > > Subsystem: Hewlett-Packard Company Device 8332 > > > The AMD iommu driver allocates a new domain (called v2 domain) for the > > "v2 domain" needs to be something greppable -- an identifier, > > filename, etc. > > The code reads, > > 2052 if (iommu_feature(iommu, FEATURE_GT) && > 2053 iommu_feature(iommu, FEATURE_PPR)) { > 2054 iommu->is_iommu_v2 = true; > > So, how about > > ..The AMD GPU has a private interface to its own AMD IOMMU, which could > be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu > driver allocates a special domain for the GPU device .. Where is this special domain allocated? I think the above tests for *IOMMU* features (I assume "GTSup: Guest translations supported" and "PPRSup: Peripheral page request support" based on the AMD IOMMU spec). It doesn't test that this is a GPU. This change doesn't feel safe for all possible devices that have a PASID Capability because we don't know whether they *always* use Translated addresses with PASID TLPs. Bjorn
On Tue, Jan 31, 2023 at 05:50:52PM -0600, Bjorn Helgaas wrote: > On Mon, Jan 30, 2023 at 02:47:32PM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote: > > > > > Sorry, I'm still confused. PCI_PASID_XLATED_REQ_ONLY is a > > > device-specific property, and you want to opt-in AMD graphics devices. > > > Where's the AMD graphics-specific change? The current patch does > > > this: > > > > > > pdev_pri_ats_enable > > > pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY) > > > > > > which looks like it does it for *all* devices below an AMD IOMMU, > > > without any device or driver input. > > > > AMD GPU has a private interface to AMD IOMMU to support PASID support > > that only it uses. > > What is it that makes this a private interface? The symbol names start with "amd" drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_snp_en); drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_v2_supported); drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_get_max_banks); drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_supported); drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_get_max_counters); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_register_ga_log_notifier); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL_GPL(amd_iommu_is_attach_deferred); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_register_ppr_notifier); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_direct_map); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_enable_v2); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_flush_page); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_flush_tlb); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_set_gcr3); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_clear_gcr3); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_complete_ppr); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_device_info); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_activate_guest_mode); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_deactivate_guest_mode); drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_update_ga); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_bind_pasid); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_unbind_pasid); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_init_device); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_free_device); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_set_invalid_ppr_cb); drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_set_invalidate_ctx_cb); A driver should not be using EXPORT_SYMBOL at all, this is all superseded by the core code that has now been created, but this has not been cleaned up. So the troublesome PASID bit is here: drivers/gpu/drm/amd/amdkfd/kfd_iommu.c: err = amd_iommu_bind_pasid(dev->adev->pdev, p->pasid, p->lead_thread); drivers/gpu/drm/amd/amdkfd/kfd_iommu.c: err = amd_iommu_bind_pasid(kfd->adev->pdev, p->pasid, And the logic AMD iommu uses to call pci_enable_pasid() is in the wrong place, it should be in drm/amd someplace not in the iommu drivers. This is all more stuff to fix > But amd_iommu_domain_alloc() also leads to domain_enable_v2(), and > that's pretty generic, so it looks like we set PD_IOMMUV2_MASK > whenever the IOMMU supports it. Yes, it is all sort of messy still. AMD and ARM have a requirement that the RID page table format be in a certain way to be able to enable the PASID decoded in the iommu So the iommu drivers are trying to guess what page table format to use based on the PCI caps, and wrongly turning on PASID mode at the same time. > I guess I'm trying to convince myself that no harm in enabling PASID > for any device below an AMD v2 IOMMU. But I don't think a device is > *required* to use translated addresses with PASID, and if it uses > untranslated addresses with PASID, don't we need ACS to avoid > misrouting? PASID enabling via config space doesn't actually do much - it is attaching a PASID at the iommu and attempting to operate the device with a PASID that is the key item. So right now, the only thing in the kernel which can do that is amdkfd because of the private interface. AMD says amdkfd HW always issues ATS with a PASID and never a MemRd/Wr, which is why it works at all. Jason
On Tue, Jan 31, 2023 at 06:14:19PM -0600, Bjorn Helgaas wrote: > > AMD GPU is one of those devices. > > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? > And furthermore, that the GPU *always* uses Translated addresses with > PASID? I'm not versed in the spec lingo, but the GPU issues MemRd/Wrs with the translated bit set and no PASID header - which is the correct form for an address that was translated by ATS. To get to that it issues ATS requests, and only the ATS related requests will carry the PASID. ATS related requests always route to the root port, which is why it is functionally equivalent to ACS RR/UF in these cases. Translated requests always route where they are supposed to go, even with P2P and things. > And this applies even if there is no ACS or ACS doesn't support > PCI_ACS_RR and PCI_ACS_UF. > > The black screen happens because ... ? AMD GPU driver bugs blow up if it cannot setup PASID. > I couldn't figure out the NULL pointer dereference. I expected it to > be from a BUG() or similar in report_iommu_fault(), but I don't see > that. IIRC it is a buggy error unwind handling in the AMD GPU driver. Jason
Bjorn, On 2/1/2023 5:44 AM, Bjorn Helgaas wrote: > On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote: >> On 2023/1/31 2:38, Bjorn Helgaas wrote: >>>> PCI: Add translated request only flag for pci_enable_pasid() >>>> >>>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >>>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >>>> Enable PASID only when ACS RR & UF enabled on upstream path") requires >>>> some ACS features being supported on device's upstream path when enabling >>>> PCI/PASID. > > Looking up 201007ef707a to see what ensuring system integrity means, > it prevents Memory Requests with PASID, which should always be routed > to the RC, from being mistakenly routed as peer-to-peer requests. > >>>> However, above change causes the Linux kernel boots to black screen on a >>>> system with below graphic device: >>> >>> We need a PCIe concept-level description of the issue first, i.e., in >>> terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue >>> as an instance. >> >> How about below description? > > Thanks, this is exactly the sort of thing I'm looking for. But my > understanding of ATS/PRI/PASID is weak, so I'm still working through > this. Tell me when I say something wrong below... > >> PCIe endpoints can use ATS to request DMA remapping hardware to >> translate an IOVA to its mapped physical address. If the translation is >> missing or the permissions are insufficient, the PRI is used to trigger >> an I/O page fault. The IOMMU driver will fill the mapping with desired >> permissions and return the translated address to the device. > > In PCIe spec language, I think you're saying that a PCIe Function may > contain an ATC. If the ATC Capability Enable bit is set, the Function > can issue Translation Requests. > > The TA (aka IOMMU) will respond with a Translation Completion. If the > Completion is a CplD, it contains the translated address and the > Function can store the entry in its ATC. I assume the I/O page fault > case corresponds to a Cpl (with no data) meaning that the TA could not > translate the address. > > If the TA doesn't have a mapping with the desired permissions, and the > Function's Page Request Capability Enable bit is set, it may issue a > Page Request Message. It's up to the TA/IOMMU to make this message > visible to the OS, which can make the page resident, create an IOMMU > mapping, and enable a PRG Response Message. After the Function > receives the PRG Response Message, it would issue another Translation > Request. > >> The translated address is specified by the IOMMU driver. The IOMMU >> driver ensures that the address is a DMA buffer address instead of any >> P2P address in the PCI fabric. Therefore, any translated memory request >> will eventually be routed to IOMMU regardless of whether there is ACS >> control in the up-streaming path. > > A Memory Request with an address that is not a P2P address, i.e., it > is not contained in any bridge aperture, will *always* be routed > toward the RC, won't it? Isn't that the case regardless of whether > the address is translated or untranslated, and even regardless of ACS? > > IIUC, ACS basically causes peer-to-peer requests to be routed upstream > instead of directly to the peer. > > OK, reading this again, I realize that I just restated exactly what > you had already written, sorry about that. > >> AMD GPU is one of those devices. > > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? > And furthermore, that the GPU *always* uses Translated addresses with > PASID? > > So I guess what's going on here is that if: > > - A device only uses PASID with Translated addresses, and > - those Translated addresses are never P2P addresses, then > - those transactions will always be routed to the RC. > > And this applies even if there is no ACS or ACS doesn't support > PCI_ACS_RR and PCI_ACS_UF. > > The black screen happens because ... ? > > What can we include in the commit log to help people find this fix? I > see these in the bugzilla: > > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 > > (These look like defects in pdev_pri_ats_enable(), so really just > distractions) Right. We have fixed error handling path in this function. Joerg has queued the fix. > > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874 > kfd kfd: amdgpu: device 1002:9874 NOT added due to errors > BUG: kernel NULL pointer dereference, address: 0000000000000058 > RIP: 0010:report_iommu_fault+0x11/0x90 > > I couldn't figure out the NULL pointer dereference. I expected it to > be from a BUG() or similar in report_iommu_fault(), but I don't see > that. Its coming from below path : - During system boot IOMMU allocates default domain - AMD IOMMU v2 module (iommu_v2) created another domain and tried to attach devices to new domain. - In device attachment path (amd_iommu_attach_device()) it first detaches device from current domain and tries to attach device to new domain. Here device attachment failed as PASID enable check failed. - We didn't recover from above failure (I have proposed fix for this [1]). - So device to domain attachment is not in consistent state. - Device tried to do DMA and hit IO fault. Above NULL pointer derefence is coming from that path as dev to domain setup is not proper. [1] https://lore.kernel.org/linux-iommu/20230113135956.5788-1-vasant.hegde@amd.com/T/#t -Vasant
On 2023/2/1 8:14, Bjorn Helgaas wrote: >> The translated address is specified by the IOMMU driver. The IOMMU >> driver ensures that the address is a DMA buffer address instead of any >> P2P address in the PCI fabric. Therefore, any translated memory request >> will eventually be routed to IOMMU regardless of whether there is ACS >> control in the up-streaming path. > A Memory Request with an address that is not a P2P address, i.e., it > is not contained in any bridge aperture, will*always* be routed > toward the RC, won't it? Yes. > Isn't that the case regardless of whether > the address is translated or untranslated, and even regardless of ACS? They are different. The translated addresses are approved by the Linux kernel. But untranslated addresses are not. Malicious or buggy userspace applications could program the device to DMA to addresses locating in the P2P aperture. > IIUC, ACS basically causes peer-to-peer requests to be routed upstream > instead of directly to the peer. Yes. Best regards, baolu
On 2023/2/1 8:14, Bjorn Helgaas wrote: > On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote: >> On 2023/1/31 2:38, Bjorn Helgaas wrote: >>>> PCI: Add translated request only flag for pci_enable_pasid() >>>> >>>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring >>>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI: >>>> Enable PASID only when ACS RR & UF enabled on upstream path") requires >>>> some ACS features being supported on device's upstream path when enabling >>>> PCI/PASID. > > Looking up 201007ef707a to see what ensuring system integrity means, > it prevents Memory Requests with PASID, which should always be routed > to the RC, from being mistakenly routed as peer-to-peer requests. Yes, exactly. I will update above paragraph like below The PCIe fabric routes Memory Requests based on the TLP address, ignoring the PASID. In order to prevent Memory Requests with PASID, which should always be routed to the RC, from being mistakenly routed as peer-to-peer requests, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") requires some ACS features being supported on device's upstream path when enabling PCI/PASID. > >>>> However, above change causes the Linux kernel boots to black screen on a >>>> system with below graphic device: >>> >>> We need a PCIe concept-level description of the issue first, i.e., in >>> terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue >>> as an instance. >> >> How about below description? > > Thanks, this is exactly the sort of thing I'm looking for. But my > understanding of ATS/PRI/PASID is weak, so I'm still working through > this. Tell me when I say something wrong below... > >> PCIe endpoints can use ATS to request DMA remapping hardware to >> translate an IOVA to its mapped physical address. If the translation is >> missing or the permissions are insufficient, the PRI is used to trigger >> an I/O page fault. The IOMMU driver will fill the mapping with desired >> permissions and return the translated address to the device. > > In PCIe spec language, I think you're saying that a PCIe Function may > contain an ATC. If the ATC Capability Enable bit is set, the Function > can issue Translation Requests. > > The TA (aka IOMMU) will respond with a Translation Completion. If the > Completion is a CplD, it contains the translated address and the > Function can store the entry in its ATC. I assume the I/O page fault > case corresponds to a Cpl (with no data) meaning that the TA could not > translate the address. > > If the TA doesn't have a mapping with the desired permissions, and the > Function's Page Request Capability Enable bit is set, it may issue a > Page Request Message. It's up to the TA/IOMMU to make this message > visible to the OS, which can make the page resident, create an IOMMU > mapping, and enable a PRG Response Message. After the Function > receives the PRG Response Message, it would issue another Translation > Request. > >> The translated address is specified by the IOMMU driver. The IOMMU >> driver ensures that the address is a DMA buffer address instead of any >> P2P address in the PCI fabric. Therefore, any translated memory request >> will eventually be routed to IOMMU regardless of whether there is ACS >> control in the up-streaming path. > > A Memory Request with an address that is not a P2P address, i.e., it > is not contained in any bridge aperture, will *always* be routed > toward the RC, won't it? Isn't that the case regardless of whether > the address is translated or untranslated, and even regardless of ACS? > > IIUC, ACS basically causes peer-to-peer requests to be routed upstream > instead of directly to the peer. > > OK, reading this again, I realize that I just restated exactly what > you had already written, sorry about that. Never mind. It's really a good chance for me to learn how to describe this from the perspective of the PCI subsystem.. :-) > >> AMD GPU is one of those devices. > > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? > And furthermore, that the GPU *always* uses Translated addresses with > PASID? > > So I guess what's going on here is that if: > > - A device only uses PASID with Translated addresses, and > - those Translated addresses are never P2P addresses, then > - those transactions will always be routed to the RC. > > And this applies even if there is no ACS or ACS doesn't support > PCI_ACS_RR and PCI_ACS_UF. Yes. > > The black screen happens because ... ? > > What can we include in the commit log to help people find this fix? I > see these in the bugzilla: > > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 > WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 > > (These look like defects in pdev_pri_ats_enable(), so really just > distractions) > > kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874 > kfd kfd: amdgpu: device 1002:9874 NOT added due to errors > BUG: kernel NULL pointer dereference, address: 0000000000000058 > RIP: 0010:report_iommu_fault+0x11/0x90 > > I couldn't figure out the NULL pointer dereference. I expected it to > be from a BUG() or similar in report_iommu_fault(), but I don't see > that. Above has been answered by Vasant Hegde in a separated reply. Best regards, baolu
On 2023/2/1 8:14, Bjorn Helgaas wrote: >>>> The AMD iommu driver allocates a new domain (called v2 domain) for the >>> "v2 domain" needs to be something greppable -- an identifier, >>> filename, etc. >> The code reads, >> >> 2052 if (iommu_feature(iommu, FEATURE_GT) && >> 2053 iommu_feature(iommu, FEATURE_PPR)) { >> 2054 iommu->is_iommu_v2 = true; >> >> So, how about >> >> ..The AMD GPU has a private interface to its own AMD IOMMU, which could >> be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu >> driver allocates a special domain for the GPU device .. > Where is this special domain allocated? I think the above tests for > *IOMMU* features (I assume "GTSup: Guest translations supported" and > "PPRSup: Peripheral page request support" based on the AMD IOMMU > spec). It doesn't test that this is a GPU. From the discussion, my understanding is that the IOMMU is a GPU dedicated IOMMU. The translated-request-only feature is enumerated through the IOMMU feature bits. However, I am not familiar with AMD architecture. AMD guys may have a better explanation. > This change doesn't feel safe for all possible devices that have a > PASID Capability because we don't know whether they*always* use > Translated addresses with PASID TLPs. I tend to think you are right. :-) Best regards, baolu
On Tue, 31 Jan 2023 22:36:27 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Jan 31, 2023 at 06:14:19PM -0600, Bjorn Helgaas wrote: > > > > AMD GPU is one of those devices. > > > > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities? > > And furthermore, that the GPU *always* uses Translated addresses with > > PASID? > > I'm not versed in the spec lingo, but the GPU issues MemRd/Wrs with > the translated bit set and no PASID header - which is the correct form > for an address that was translated by ATS. FWIW there is a capability bit and enable bit in the PASID cap/control registers that says whether a device can/should add a PASID to a translated request or not. I think the intent is that a host can sanity check AT requests to make sure the device isn't making them up. To do that it needs the PASID. Not sure any hosts do this yet though ;) Not worth much, but I thought it always sent the PASID so dug out spec to check (I was wrong as it is both optional and configurable). > > To get to that it issues ATS requests, and only the ATS related > requests will carry the PASID. > > ATS related requests always route to the root port, which is why it is > functionally equivalent to ACS RR/UF in these cases. > > Translated requests always route where they are supposed to go, even > with P2P and things. > > > And this applies even if there is no ACS or ACS doesn't support > > PCI_ACS_RR and PCI_ACS_UF. > > > > The black screen happens because ... ? > > AMD GPU driver bugs blow up if it cannot setup PASID. > > > I couldn't figure out the NULL pointer dereference. I expected it to > > be from a BUG() or similar in report_iommu_fault(), but I don't see > > that. > > IIRC it is a buggy error unwind handling in the AMD GPU driver. > > Jason
On Wed, Feb 01, 2023 at 02:31:04PM +0800, Baolu Lu wrote: > > This change doesn't feel safe for all possible devices that have a > > PASID Capability because we don't know whether they*always* use > > Translated addresses with PASID TLPs. > > I tend to think you are right. :-) This is why we discussed moving the pci_enable_pasid() into drivers and out of the core iommu code as a followup Jason
On Tue, Jan 31, 2023 at 08:25:05PM +0800, Baolu Lu wrote: > On 2023/1/31 2:38, Bjorn Helgaas wrote: > > PCIe r6.0, sec 6.20.1: > > > > A Function is not permitted to generate Requests using Translated > > Addresses and a PASID unless both PASID Enable and Translated > > Requests with PASID Enable are Set. > > > > You want AMD graphics devices to do DMA with translated addresses and > > PASID, right? pci_enable_pasid() sets PASID Enable > > (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests > > with PASID Enable" is set. We don't even have a #define for it. > > > > I would think we should check "Translated Requests with PASID > > Supported" before setting "Translated Requests with PASID Enable", > > too? > > This seems to be an ECN for PCIe 5.x: > > https://members.pcisig.com/wg/PCI-SIG/document/14929 > > What I read from this ECN is that, > > With this ECN, translated memory requests for PASIDs are not allowed to > carry a PASID prefix if "Translated Requests with PASID Enabled" is not > set. It does not mean whether the device can generate translated memory > requests for PASID, but whether the memory request can carry a PASID > prefix. My assumption that "you want AMD graphics devices to do DMA with translated addresses and PASID" was wrong. Per Jason [1], it sounds like the AMD GPU generates Translation Requests (sec 10.2.2) with a PASID. The GPU will cache the translated address from the Translation Completion in its local ATC, and will do DMA (MemRd/Wr) with that translated address but *without* PASID prefixes. That makes sense because (PASID, IOVA) maps to a translated address, e.g., a a CPU physical address, and the GPU can DMA to that address directly without needing the PASID. Bjorn [1] https://lore.kernel.org/r/Y9nQK9P3HOxEeZ4U@nvidia.com
On 2023/2/2 0:58, Bjorn Helgaas wrote: > On Tue, Jan 31, 2023 at 08:25:05PM +0800, Baolu Lu wrote: >> On 2023/1/31 2:38, Bjorn Helgaas wrote: >>> PCIe r6.0, sec 6.20.1: >>> >>> A Function is not permitted to generate Requests using Translated >>> Addresses and a PASID unless both PASID Enable and Translated >>> Requests with PASID Enable are Set. >>> >>> You want AMD graphics devices to do DMA with translated addresses and >>> PASID, right? pci_enable_pasid() sets PASID Enable >>> (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests >>> with PASID Enable" is set. We don't even have a #define for it. >>> >>> I would think we should check "Translated Requests with PASID >>> Supported" before setting "Translated Requests with PASID Enable", >>> too? >> This seems to be an ECN for PCIe 5.x: >> >> https://members.pcisig.com/wg/PCI-SIG/document/14929 >> >> What I read from this ECN is that, >> >> With this ECN, translated memory requests for PASIDs are not allowed to >> carry a PASID prefix if "Translated Requests with PASID Enabled" is not >> set. It does not mean whether the device can generate translated memory >> requests for PASID, but whether the memory request can carry a PASID >> prefix. > My assumption that "you want AMD graphics devices to do DMA with > translated addresses and PASID" was wrong. > > Per Jason [1], it sounds like the AMD GPU generates Translation > Requests (sec 10.2.2) with a PASID. The GPU will cache the translated > address from the Translation Completion in its local ATC, and will do > DMA (MemRd/Wr) with that translated address but*without* PASID > prefixes. > > That makes sense because (PASID, IOVA) maps to a translated address, > e.g., a a CPU physical address, and the GPU can DMA to that address > directly without needing the PASID. Hi Bjorn and Jason, According to the discussion so far, I refined the commit message like below. How does it look like? PCI: Add translated request only flag for pci_enable_pasid() The PCIe fabric routes Memory Requests based on the TLP address, ignoring the PASID. In order to prevent Memory Requests for PASID, which should always be routed to the RC, from being mistakenly routed as peer-to-peer requests, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") requires some ACS features being supported on device's upstream path when enabling PCI/PASID. However, above change causes the Linux kernel boots to black screen on a system with below graphic device: 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller]) DeviceName: ATI EG BROADWAY Subsystem: Hewlett-Packard Company Device 8332 The AMD iommu driver allocates a special domain for above device and enables its PCI PASID/ATS/PRI before attaching the domain to it. The failure of pci_enable_pasid() due to lack of ACS causes the domain to fail to be attached to the device. The GPU device is unable to DMA normally, resulting in a black screen of the system. The PCIe spec defines Address Translation Service in its chapter 10. A PCIe Function may contain an ATC. If the ATC Capability Enable bit is set, the Function can issue Translation Requests. The TA (aka IOMMU) will respond with a Translation Completion. If the Completion is a CplD, it contains the translated address and the Function can store the entry in its ATC. If the TA doesn't have a mapping with the desired permissions, and the Function's Page Request Capability Enable bit is set, it may issue a Page Request Message. It's up to the TA to make this message visible to the OS, which can make the page resident, create an IOMMU mapping, and respond with a PRG Response Message. After the Function receives the PRG Response Message, it would issue another Translation Request. The Function can then obtain a translated address and store the entry in its ATC. The AMD integrated GPU together with its dedicated IOMMU implements above functionality. It generates Translation Requests (sec 10.2.2) with a PASID and caches the translated address from the Translation Completion in its local ATC, and will do DMA (MemRd/Wr) with that translated address without PASID prefixes. This capability (using translated address for PASID memory requests) could be detected by software from AMD IOMMU feature bits (GTSup: Guest translations supported" and PPRSup: Peripheral page request support). ACS is unnecessary for the devices that only use translated memory request for PASID. All translated addresses are granted by the Linux kernel which ensures that such addresses will never be in a P2P address, i.e., it's not contained in any bridge aperture, will *always* be routed toward the RC. Add a flag for pci_enable_pasid(), with which the drivers could opt-in the fact that device always uses translated memory requests for PASID hence the ACS is not a necessity. Apply this opt-in for above AMD graphic device. At present, it is a common practice to enable/disable PCI PASID in the iommu drivers. Considering that the device driver knows more about the specific device, it is recommended to move pci_enable_pasid() into the specific device drivers. Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path") Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865 Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/ Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> -- Best regards, baolu
[Joerg, you may be able to answer this. Patch under discussion is: https://lore.kernel.org/r/20230114073420.759989-1-baolu.lu@linux.intel.com] On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote: > ... > ACS is unnecessary for the devices that only use translated memory request > for PASID. All translated addresses are granted by the Linux kernel which > ensures that such addresses will never be in a P2P address, i.e., it's not > contained in any bridge aperture, will *always* be routed toward the RC. Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path"), does that commit actually *fix* anything? I wonder whether we could revert it completely. The intent of 201007ef707a is to use ACS to prevent misrouting, which would happen if a TLP contained an address that *looked* like a PCI bus address, i.e., it was inside a host bridge aperture, but was *intended* to reach an IOMMU or main memory directly. 201007ef707a only affects pci_enable_pasid(), so I think we already avoid this misrouting by restricting DMA address allocation for both non-IOMMU scenarios and non-PASID IOMMU scenarios. So what about PASID mappings, e.g., consider a mapping of (Requester ID, PASID, Untranslated Address) -> Translated Address? If either the Untranslated Address or the Translated Address looks like a PCI bus address, a Memory Request or Translation Request could be misrouted. Does that actually happen? I assume it does not happen for Translated Addresses because that's basically the non-IOMMU case, and we don't need ACS to prevent misrouting there. Do IOMMUs allocate (PASID, Untranslated Addresses) that look like PCI bus addresses? Bjorn
On Thu, Feb 02, 2023 at 02:12:49PM -0600, Bjorn Helgaas wrote: > [Joerg, you may be able to answer this. Patch under discussion is: > https://lore.kernel.org/r/20230114073420.759989-1-baolu.lu@linux.intel.com] > > On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote: > > ... > > > ACS is unnecessary for the devices that only use translated memory request > > for PASID. All translated addresses are granted by the Linux kernel which > > ensures that such addresses will never be in a P2P address, i.e., it's not > > contained in any bridge aperture, will *always* be routed toward the RC. > > Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on > upstream path"), does that commit actually *fix* anything? I wonder > whether we could revert it completely. > > The intent of 201007ef707a is to use ACS to prevent misrouting, which > would happen if a TLP contained an address that *looked* like a PCI > bus address, i.e., it was inside a host bridge aperture, but was > *intended* to reach an IOMMU or main memory directly. Yes. > 201007ef707a only affects pci_enable_pasid(), so I think we already > avoid this misrouting by restricting DMA address allocation for both > non-IOMMU scenarios and non-PASID IOMMU scenarios. There is no restriction on DMA address allocation with PASID. The typical PASID use case is to point the PASID at the CPU page table and then all VA's are fair game by userspace. There is no carve out like the DMA API has to protect from bus address confusion. > So what about PASID mappings, e.g., consider a mapping of (Requester > ID, PASID, Untranslated Address) -> Translated Address? If either the > Untranslated Address or the Translated Address looks like a PCI bus > address, a Memory Request or Translation Request could be misrouted. The PCI rules are a bit complicated: - A simple MemRd/Wr with a PASID will be routed according to the address. This can be mis-routed - A ATS translation request with a PASID is always routed to the host bridge - A MemRd/Wr with Translated set and no PASID is always routed to the correct destination, even if that is not the host bridge > Do IOMMUs allocate (PASID, Untranslated Addresses) that look like PCI > bus addresses? Yes, because it is mapped to a mm_struct userspace can use any mmap to access any valid address as an IOVA and thus PASID tagged translation must never become confused with bus addresses. Further, and worse, the common use model for PASID SVA is for userspace to directly submit IOVA to the device for operation. If userspace can submit a hostile IOVA and cause DMA to reach something that is not the host bridge then system security is completely wrecked. So, as an API in Linux we felt it was best to only enable PASID if PASID is secure and truely isolated, otherwise leave PASID off. The use cases for insecure PASID seem small. Jason
[+cc Alex in case you're interested in the ACS angle] On Thu, Feb 02, 2023 at 04:45:05PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 02, 2023 at 02:12:49PM -0600, Bjorn Helgaas wrote: > > On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote: > > > ... > > > > > ACS is unnecessary for the devices that only use translated > > > memory request for PASID. All translated addresses are granted > > > by the Linux kernel which ensures that such addresses will never > > > be in a P2P address, i.e., it's not contained in any bridge > > > aperture, will *always* be routed toward the RC. > > > > Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled > > on upstream path"), does that commit actually *fix* anything? I > > wonder whether we could revert it completely. > > > > The intent of 201007ef707a is to use ACS to prevent misrouting, > > which would happen if a TLP contained an address that *looked* > > like a PCI bus address, i.e., it was inside a host bridge > > aperture, but was *intended* to reach an IOMMU or main memory > > directly. > > Yes. > > > 201007ef707a only affects pci_enable_pasid(), so I think we > > already avoid this misrouting by restricting DMA address > > allocation for both non-IOMMU scenarios and non-PASID IOMMU > > scenarios. > > There is no restriction on DMA address allocation with PASID. > > The typical PASID use case is to point the PASID at the CPU page > table and then all VA's are fair game by userspace. There is no > carve out like the DMA API has to protect from bus address > confusion. I think you're saying that for (Requester ID, PASID, Untranslated Address), the Untranslated Address is not restricted at all, and it may look like a PCI bus address. > > So what about PASID mappings, e.g., consider a mapping of > > (Requester ID, PASID, Untranslated Address) -> Translated Address? > > If either the Untranslated Address or the Translated Address looks > > like a PCI bus address, a Memory Request or Translation Request > > could be misrouted. > > The PCI rules are a bit complicated: > - A simple MemRd/Wr with a PASID will be routed according to the > address. This can be mis-routed > - A ATS translation request with a PASID is always routed to the host > bridge From a PCIe point of view, I think these cases are equivalent because a PASID prefix doesn't affect routing (sec 2.2.10.4). A Translation Request includes an Untranslated Address, and if that happens to look like a PCI bus address, I think it will be mis-routed just like a MemRd/Wr would be. > - A MemRd/Wr with Translated set and no PASID is always routed to the > correct destination, even if that is not the host bridge I don't think Address Type 10b ("Translated") affects routing. A MemRd/Wr should be routed to a PCI peer if the Translated Address is inside a host bridge aperture, or to the host bridge otherwise. > > Do IOMMUs allocate (PASID, Untranslated Addresses) that look like > > PCI bus addresses? > > Yes, because it is mapped to a mm_struct userspace can use any mmap > to access any valid address as an IOVA and thus PASID tagged > translation must never become confused with bus addresses. If PCI bus addresses are carved out of the Translated Address arena, the MemRd/Wr TLPs should be fine, but I think the Translation Requests that include Untranslated Addresses are still a problem. > Further, and worse, the common use model for PASID SVA is for > userspace to directly submit IOVA to the device for operation. If > userspace can submit a hostile IOVA and cause DMA to reach something > that is not the host bridge then system security is completely > wrecked. > > So, as an API in Linux we felt it was best to only enable PASID if > PASID is secure and truely isolated, otherwise leave PASID off. The > use cases for insecure PASID seem small. The patch under discussion is intended to fix a v6.2-rc1 regression added by 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path"). Are we on track to fix this before v6.2? I don't have a clear understanding of how we know this change is safe and it only affects AMD GPU and not other devices below the same IOMMU. Bjorn
On Fri, Feb 03, 2023 at 12:20:45PM -0600, Bjorn Helgaas wrote: > > The PCI rules are a bit complicated: > > - A simple MemRd/Wr with a PASID will be routed according to the > > address. This can be mis-routed > > - A ATS translation request with a PASID is always routed to the host > > bridge > > From a PCIe point of view, I think these cases are equivalent because > a PASID prefix doesn't affect routing (sec 2.2.10.4). A Translation > Request includes an Untranslated Address, and if that happens to look > like a PCI bus address, I think it will be mis-routed just like a > MemRd/Wr would be. So, I trusted AMD when they said this is why their platform worked, I didn't check carefully the ATS spec language. But looking now, I agree. I don't see any language that says ATS is routed differently, if anything it says it is routed the same way as MemRd. AMD folks? If yes then this has all been the wrong direction. > > > Do IOMMUs allocate (PASID, Untranslated Addresses) that look like > > > PCI bus addresses? > > > > Yes, because it is mapped to a mm_struct userspace can use any mmap > > to access any valid address as an IOVA and thus PASID tagged > > translation must never become confused with bus addresses. > > If PCI bus addresses are carved out of the Translated Address arena, > the MemRd/Wr TLPs should be fine, Yes, that is my point that Translated requests are fine because they already carry a Translated address, and by its nature a Translated address cannot be mis-routed. The spec actually talks about this whole issue see the NOTE at the end of 10.1.1 Per that note, for Linux, the "implementation constraint on the TA" is that the TA requires ACS so that Untranslated always routes to the TA. The AMD defect is some BIOS's on some of their SOC's don't report ACS for the MFD, even though it does presumably implement ACS. > Are we on track to fix this before v6.2? AMD? Did the patches you were talking about to fix the oops in the AMD driver land? Can we rely on that to fix the black screen for v6.2? Otherwise I think the PCI core is correct here and I'm loath to change it to work around a GPU driver bug. The GPU driver should understand that PASID is not supported because the PCI core says so. It shouldn't crash and blackscreen. Keep in mind, from a PCI perspective, this protection is a security senstive check, described in the spec. If we drop it we expose platforms using SVA to a potential security issue upon mis-configuration. So, I'd much rather leave the PCI alone and see that the AMD driver is fixed. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, February 4, 2023 2:53 AM > > The AMD defect is some BIOS's on some of their SOC's don't report ACS > for the MFD, even though it does presumably implement ACS. > If that is the case looks AMD driver should implement a pci quirk so pci_dev_specific_acs_enabled() can correctly report to meet pci core requirement.
diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h index df54cd5b15db..750fca736ef4 100644 --- a/include/linux/pci-ats.h +++ b/include/linux/pci-ats.h @@ -4,6 +4,8 @@ #include <linux/pci.h> +#define PCI_PASID_XLATED_REQ_ONLY BIT(0) + #ifdef CONFIG_PCI_ATS /* Address Translation Service */ bool pci_ats_supported(struct pci_dev *dev); @@ -35,12 +37,12 @@ static inline bool pci_pri_supported(struct pci_dev *pdev) #endif /* CONFIG_PCI_PRI */ #ifdef CONFIG_PCI_PASID -int pci_enable_pasid(struct pci_dev *pdev, int features); +int pci_enable_pasid(struct pci_dev *pdev, int features, int flags); void pci_disable_pasid(struct pci_dev *pdev); int pci_pasid_features(struct pci_dev *pdev); int pci_max_pasids(struct pci_dev *pdev); #else /* CONFIG_PCI_PASID */ -static inline int pci_enable_pasid(struct pci_dev *pdev, int features) +static inline int pci_enable_pasid(struct pci_dev *pdev, int features, int flags) { return -EINVAL; } static inline void pci_disable_pasid(struct pci_dev *pdev) { } static inline int pci_pasid_features(struct pci_dev *pdev) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index cbeaab55c0db..64a8c03d7dfa 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1700,7 +1700,7 @@ static int pdev_pri_ats_enable(struct pci_dev *pdev) int ret; /* Only allow access to user-accessible pages */ - ret = pci_enable_pasid(pdev, 0); + ret = pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY); if (ret) goto out_err; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index ab160198edd6..891bf53c45dc 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2350,7 +2350,7 @@ static int arm_smmu_enable_pasid(struct arm_smmu_master *master) if (num_pasids <= 0) return num_pasids; - ret = pci_enable_pasid(pdev, features); + ret = pci_enable_pasid(pdev, features, 0); if (ret) { dev_err(&pdev->dev, "Failed to enable PASID\n"); return ret; diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 59df7e42fd53..5cc13f02a5ac 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1425,7 +1425,8 @@ static void iommu_enable_pci_caps(struct device_domain_info *info) undefined. So always enable PASID support on devices which have it, even if we can't yet know if we're ever going to use it. */ - if (info->pasid_supported && !pci_enable_pasid(pdev, info->pasid_supported & ~1)) + if (info->pasid_supported && + !pci_enable_pasid(pdev, info->pasid_supported & ~1, 0)) info->pasid_enabled = 1; if (info->pri_supported && diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index f9cc2e10b676..06168415d6d7 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev) * pci_enable_pasid - Enable the PASID capability * @pdev: PCI device structure * @features: Features to enable + * @flags: device-specific flags + * - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type + * for all PASID memory requests. * * Returns 0 on success, negative value on error. This function checks * whether the features are actually supported by the device and returns * an error if not. */ -int pci_enable_pasid(struct pci_dev *pdev, int features) +int pci_enable_pasid(struct pci_dev *pdev, int features, int flags) { u16 control, supported; int pasid = pdev->pasid_cap; @@ -382,7 +385,8 @@ int pci_enable_pasid(struct pci_dev *pdev, int features) if (!pasid) return -EINVAL; - if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF)) + if (!(flags & PCI_PASID_XLATED_REQ_ONLY) && + !pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF)) return -EINVAL; pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);