Message ID | 20230921075138.124099-9-yi.l.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp5385524vqi; Fri, 22 Sep 2023 00:36:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFP3I83xMwjcg2IFlPYv438v8viBwgoQfwDlmSFuNXIY+O9GRWzHZqpG5qV97E9WASpwj4v X-Received: by 2002:a05:6870:51c6:b0:1d6:3b76:aae1 with SMTP id b6-20020a05687051c600b001d63b76aae1mr8140532oaj.39.1695368202553; Fri, 22 Sep 2023 00:36:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695368202; cv=none; d=google.com; s=arc-20160816; b=KbTT5n7Bk/l+OxC8M8YU75UvEZJz9RHA3scwEdtiRD99e6mG571hZBGOgyurWVEcjs EBmqXFioY8zXqKhTzCOcIgVOBz+I7UpEuw9GflbusPjbpo8ft9HKfGSey09KFHpWgTPA 47n2R1WVg2N4whqQeZpHecNmcZUojZSg4XcyyvcR75MD/T68lgv8HygHU2IjhsMlm/4Z 2LOBICpriRMQNpSyScIf3yzbAkNu0mY+ZboRWC39yu2RcPRIDOAU1sykEKFPuO1M4UW6 7gblaOgQpUTD3f4JpjEhYJgdkxvn3m17FnBt08Bl4BpNN9lxqsuwOyU/Y+0sTSU50Vbn qkYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=R+nexHPrSjMqZbu+W8Bsry4y++E1vFzwIQ5Ve/gir9w=; fh=ncJBVmsnOSqrX1O37yfYEzicwaA2e7ARxnsU7aiysyE=; b=DdJ4k/dwEvtIGg/Yr7OHmuD7nmblO4CdfO+WvwSa8whs2D6eBBKAdWixh/eIPDr7y6 9AoiuOq8jU0MrVntN0xj3MRRUMq47ZR+rgH8cBaczPC6P9nPWoUZw6a0PFEmG/eJuGr6 k3s00MCA4en5YPrf1ZV8L8WITDYY1Y8ehaPpzXbkgLCMXo9VSMo+BtLqDjmb4l6LBHr1 33rL721K56yMAoD+9bU9kcu0lpC1obVn2bz0vp3N8NDeDPTGmLpyfSZ/38hA2UcPLZbv es3jkVEMblQgmlVktZXscT0E9baPZpxbtWaCYj7BOFS14mKA4mK7uaNIDGCjETR26m6X G2vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Bqe5phMe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id w20-20020a63fb54000000b00578b4edf57esi3111780pgj.797.2023.09.22.00.36.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 00:36:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Bqe5phMe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 8375F82F9256; Thu, 21 Sep 2023 13:28:18 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231945AbjIUU2H (ORCPT <rfc822;pwkd43@gmail.com> + 29 others); Thu, 21 Sep 2023 16:28:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231814AbjIUU1b (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 21 Sep 2023 16:27:31 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A412A5102B; Thu, 21 Sep 2023 10:49:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695318592; x=1726854592; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QCYSdfSXFxQ7WaALFaYm4jsUT2kSUGMBFqo7HP5RmjU=; b=Bqe5phMeghF+N5rNuPdtcmuSre15gxLq1eYXVP8k6t2D1hg2VGw+C/Od FIfujusl9xw7Fl/UTsnENWakjPGaVN+UePGAdtzaP3Vl36E3THsuQ9O1v r+/EadOqnwIDI6VF8SFW/xE8DQJs9UxkgCPpYo5wS/j81LjbiJMjsQsV3 ZW5qVT4TPWLF4uleaG81PnZnBfUWgb6I1QXIgRr0vWLsDJPaP3vUcT8qu 8onPbopz0fK9FmEpQNArzrks3UzInlnV3CHKh7sT/TaBEqIGQqlXvPw3q XkxT5wDo63Uqdjyw7m9rSPCqWCdoygy9aBglkK+RqMyWIpUIyZ+NYDI+i Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10839"; a="359832885" X-IronPort-AV: E=Sophos;i="6.03,164,1694761200"; d="scan'208";a="359832885" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Sep 2023 00:52:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10839"; a="723649546" X-IronPort-AV: E=Sophos;i="6.03,164,1694761200"; d="scan'208";a="723649546" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by orsmga006.jf.intel.com with ESMTP; 21 Sep 2023 00:52:13 -0700 From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com, robin.murphy@arm.com, baolu.lu@linux.intel.com Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com, shameerali.kolothum.thodi@huawei.com, lulu@redhat.com, suravee.suthikulpanit@amd.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, zhenzhong.duan@intel.com, joao.m.martins@oracle.com Subject: [PATCH v4 08/17] iommufd: Always setup MSI and anforce cc on kernel-managed domains Date: Thu, 21 Sep 2023 00:51:29 -0700 Message-Id: <20230921075138.124099-9-yi.l.liu@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230921075138.124099-1-yi.l.liu@intel.com> References: <20230921075138.124099-1-yi.l.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 21 Sep 2023 13:28:18 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777722408518892688 X-GMAIL-MSGID: 1777722408518892688 |
Series |
iommufd: Add nesting infrastructure
|
|
Commit Message
Yi Liu
Sept. 21, 2023, 7:51 a.m. UTC
From: Nicolin Chen <nicolinc@nvidia.com> Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt things. So, they should be only setup on kernel-managed domains. If the attaching domain is a user-managed domain, redirect the hwpt to hwpt->parent to do it correctly. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Co-developed-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/iommu/iommufd/device.c | 4 ++++ drivers/iommu/iommufd/hw_pagetable.c | 4 ++++ 2 files changed, 8 insertions(+)
Comments
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Thursday, September 21, 2023 3:51 PM > > From: Nicolin Chen <nicolinc@nvidia.com> > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt > things. > So, they should be only setup on kernel-managed domains. If the attaching > domain is a user-managed domain, redirect the hwpt to hwpt->parent to do > it correctly. > No redirection. The parent should already have the configuration done when it's created. It shouldn't be triggered in the nesting path.
On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Thursday, September 21, 2023 3:51 PM > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt > > things. > > So, they should be only setup on kernel-managed domains. If the attaching > > domain is a user-managed domain, redirect the hwpt to hwpt->parent to do > > it correctly. > > > > No redirection. The parent should already have the configuration done > when it's created. It shouldn't be triggered in the nesting path. iommufd_hw_pagetable_enforce_cc() is not only called in alloc(), but also in hwpt_attach/replace() if cc is not enforced by the alloc() because the idev that initiates the hwpt_alloc() might not have idev->enforce_cache_coherency. Only when another idev that has idev->enforce_cache_coherency attaches to the shared hwpt, the cc configuration would be done. In a nesting case encountering the same situation above, the S2 hwpt is allocated without the iommufd_hw_pagetable_enforce_cc(). But the 2nd idev that has idev->enforce_cache_coherency might attach directly to the S1 domain/hwpt without touching the S2 domain (for the same VM, S2 domain can be shared). In this case, without a redirection, the iommufd_hw_pagetable_enforce_cc() would be missed. Any thought? Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Saturday, October 14, 2023 8:45 AM > > On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote: > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Thursday, September 21, 2023 3:51 PM > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt > > > things. > > > So, they should be only setup on kernel-managed domains. If the > attaching > > > domain is a user-managed domain, redirect the hwpt to hwpt->parent to > do > > > it correctly. > > > > > > > No redirection. The parent should already have the configuration done > > when it's created. It shouldn't be triggered in the nesting path. > > iommufd_hw_pagetable_enforce_cc() is not only called in alloc(), > but also in hwpt_attach/replace() if cc is not enforced by the > alloc() because the idev that initiates the hwpt_alloc() might > not have idev->enforce_cache_coherency. Only when another idev > that has idev->enforce_cache_coherency attaches to the shared > hwpt, the cc configuration would be done. > is this a bug already? If the 1st device doesn't have enforce_cc in its iommu, setting the snp bit in the hwpt would lead to reserved bit violation. another problem is that intel_iommu_enforce_cache_coherency() doesn't update existing entries. It only sets a domain flag to affect future mappings. so it means the 2nd idev is also broken. The simplest option is to follow vfio type1 i.e. don't mix devices with different enforce_cc in one domain.
On Mon, Oct 16, 2023 at 08:48:03AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Saturday, October 14, 2023 8:45 AM > > > > On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Thursday, September 21, 2023 3:51 PM > > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > > > > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt > > > > things. > > > > So, they should be only setup on kernel-managed domains. If the > > attaching > > > > domain is a user-managed domain, redirect the hwpt to hwpt->parent to > > do > > > > it correctly. > > > > > > > > > > No redirection. The parent should already have the configuration done > > > when it's created. It shouldn't be triggered in the nesting path. > > > > iommufd_hw_pagetable_enforce_cc() is not only called in alloc(), > > but also in hwpt_attach/replace() if cc is not enforced by the > > alloc() because the idev that initiates the hwpt_alloc() might > > not have idev->enforce_cache_coherency. Only when another idev > > that has idev->enforce_cache_coherency attaches to the shared > > hwpt, the cc configuration would be done. > > is this a bug already? If the 1st device doesn't have enforce_cc in its > iommu, setting the snp bit in the hwpt would lead to reserved > bit violation. I suspect there are technically some gaps in the intel driver, yes.. > another problem is that intel_iommu_enforce_cache_coherency() > doesn't update existing entries. It only sets a domain flag to affect > future mappings. so it means the 2nd idev is also broken. This is such a gap, intel driver should not permit that. > The simplest option is to follow vfio type1 i.e. don't mix devices > with different enforce_cc in one domain. This is why I wanted to get rid of this bad mechanism going forward. Manually created hwpt should have a manual specification of cc and then we don't have so many problems. It means userspace needs to compute if they want to use CC or not, but userspace already needs to figure this out since without autodomains it must create two hwpts manually anyhow. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, October 16, 2023 7:58 PM > > On Mon, Oct 16, 2023 at 08:48:03AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Saturday, October 14, 2023 8:45 AM > > > > > > On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote: > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > Sent: Thursday, September 21, 2023 3:51 PM > > > > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > > > > > > > Now enforce_cache_coherency and msi_cookie are kernel-managed > hwpt > > > > > things. > > > > > So, they should be only setup on kernel-managed domains. If the > > > attaching > > > > > domain is a user-managed domain, redirect the hwpt to hwpt->parent > to > > > do > > > > > it correctly. > > > > > > > > > > > > > No redirection. The parent should already have the configuration done > > > > when it's created. It shouldn't be triggered in the nesting path. > > > > > > iommufd_hw_pagetable_enforce_cc() is not only called in alloc(), > > > but also in hwpt_attach/replace() if cc is not enforced by the > > > alloc() because the idev that initiates the hwpt_alloc() might > > > not have idev->enforce_cache_coherency. Only when another idev > > > that has idev->enforce_cache_coherency attaches to the shared > > > hwpt, the cc configuration would be done. > > > > is this a bug already? If the 1st device doesn't have enforce_cc in its > > iommu, setting the snp bit in the hwpt would lead to reserved > > bit violation. > > I suspect there are technically some gaps in the intel driver, yes.. double checked. intel driver is doing right thing now: intel_iommu_enforce_cache_coherency() domain_support_force_snooping() static bool domain_support_force_snooping(struct dmar_domain *domain) { struct device_domain_info *info; bool support = true; assert_spin_locked(&domain->lock); list_for_each_entry(info, &domain->devices, link) { if (!ecap_sc_support(info->iommu->ecap)) { support = false; break; } } return support; } > > > another problem is that intel_iommu_enforce_cache_coherency() > > doesn't update existing entries. It only sets a domain flag to affect > > future mappings. so it means the 2nd idev is also broken. > > This is such a gap, intel driver should not permit that. yes. @Baolu, can you add a fix? > > > The simplest option is to follow vfio type1 i.e. don't mix devices > > with different enforce_cc in one domain. > > This is why I wanted to get rid of this bad mechanism going forward. > > Manually created hwpt should have a manual specification of cc and > then we don't have so many problems. > > It means userspace needs to compute if they want to use CC or not, but > userspace already needs to figure this out since without autodomains > it must create two hwpts manually anyhow. > Now there is no interface reporting enforce_cc to userspace. Actually the user doesn't need to know enforce_cc. As long as there is an attach incompatibility error the user has to create a 2nd hwpt for the to-be-attached device then enforce_cc will be handled automatically in hwpt_alloc. I prefer to removing enforce_cc in attach fn completely then no parent trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to figure out the attaching compatibility: - attaching a idev with matching enforce_cc as hwpt is always allowed. - attaching a idev w/o enforce_cc to a hwpt with enforce_cc is rejected. - attaching a idev w/ enforce_cc to a hwpt w/o enforce_cc succeeds with hwpt continuing to be w/o enforce_cc. No promotion. above has been guaranteed by intel iommu driver.
> From: Tian, Kevin > Sent: Tuesday, October 17, 2023 4:53 PM > > > > This is why I wanted to get rid of this bad mechanism going forward. > > > > Manually created hwpt should have a manual specification of cc and > > then we don't have so many problems. > > Actually I don't see much difference between manual specification of cc vs. indirect specification of cc via specified idev in hwpt_alloc. What really matters is how we want to deal with the compatibility in the following attach path.
On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote: > > This is why I wanted to get rid of this bad mechanism going forward. > > > > Manually created hwpt should have a manual specification of cc and > > then we don't have so many problems. > > > > It means userspace needs to compute if they want to use CC or not, but > > userspace already needs to figure this out since without autodomains > > it must create two hwpts manually anyhow. > > > > Now there is no interface reporting enforce_cc to userspace. Yes.. > Actually the user doesn't need to know enforce_cc. As long as there is > an attach incompatibility error the user has to create a 2nd hwpt for > the to-be-attached device then enforce_cc will be handled automatically > in hwpt_alloc. Uh, OK we can do that.. It is a bit hacky because we don't know the order do > I prefer to removing enforce_cc in attach fn completely then no parent > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to > figure out the attaching compatibility: You are basically saying to set the cc mode during creation because we know the idev at that moment and can tell if it should be on/off? It seems reasonable, but I can't remember why it is in the attach path at the moment. Jason
On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote: > > I prefer to removing enforce_cc in attach fn completely then no parent > > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to > > figure out the attaching compatibility: > > You are basically saying to set the cc mode during creation because we > know the idev at that moment and can tell if it should be on/off? > > It seems reasonable, but I can't remember why it is in the attach path > at the moment. This was the commit adding it in the alloc path: https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609-44eeda11b45e@arm.com/ The older code was doing a hwpt "upgrade" from !cc to cc: - /* - * Try to upgrade the domain we have, it is an iommu driver bug to - * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail - * enforce_cache_coherency when there are no devices attached to the - * domain. - */ - if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) { - if (hwpt->domain->ops->enforce_cache_coherency) - hwpt->enforce_cache_coherency = - hwpt->domain->ops->enforce_cache_coherency( - hwpt->domain); If we remove the enforce_cc call in the attach path and let the driver decide whether to enforce or reject in attach_dev calls, there seems to be no point in tracking an enforce_cache_coherency flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU extension check in the vfio-compat pathway? Thanks Nic
On 10/17/23 4:52 PM, Tian, Kevin wrote: >>> another problem is that intel_iommu_enforce_cache_coherency() >>> doesn't update existing entries. It only sets a domain flag to affect >>> future mappings. so it means the 2nd idev is also broken. >> This is such a gap, intel driver should not permit that. > yes. @Baolu, can you add a fix? Sure thing! Best regards, baolu
On Tue, Oct 17, 2023 at 12:58:39PM -0700, Nicolin Chen wrote: > On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote: > > On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote: > > > I prefer to removing enforce_cc in attach fn completely then no parent > > > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to > > > figure out the attaching compatibility: > > > > You are basically saying to set the cc mode during creation because we > > know the idev at that moment and can tell if it should be on/off? > > > > It seems reasonable, but I can't remember why it is in the attach path > > at the moment. > > This was the commit adding it in the alloc path: > https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609-44eeda11b45e@arm.com/ > > The older code was doing a hwpt "upgrade" from !cc to cc: > - /* > - * Try to upgrade the domain we have, it is an iommu driver bug to > - * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail > - * enforce_cache_coherency when there are no devices attached to the > - * domain. > - */ > - if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) { > - if (hwpt->domain->ops->enforce_cache_coherency) > - hwpt->enforce_cache_coherency = > - hwpt->domain->ops->enforce_cache_coherency( > - hwpt->domain); > > If we remove the enforce_cc call in the attach path and let the > driver decide whether to enforce or reject in attach_dev calls, > there seems to be no point in tracking an enforce_cache_coherency > flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU > extension check in the vfio-compat pathway? I think the purpose of this code is to try to minimize the number of domains. Otherwise we have a problem where the order devices are attached to the domain decides how many domains you get. ie the first device attached does not want CC (but is compatible with it) so we create a non-CC domain Then later we attach a device that does want CC and now we are forced to create a second iommu domain when upgrading the first domain would have been fine. Hotplug is another scenario (though Intel driver does not support it, and it looks broken) Really, I hate this CC mechanism. It is only for Intel, can we just punt it to userspace and have an intel 'want cc flag' for the entire nesting path and forget about it? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, October 19, 2023 12:51 AM > > On Tue, Oct 17, 2023 at 12:58:39PM -0700, Nicolin Chen wrote: > > On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote: > > > On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote: > > > > I prefer to removing enforce_cc in attach fn completely then no parent > > > > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver > to > > > > figure out the attaching compatibility: > > > > > > You are basically saying to set the cc mode during creation because we > > > know the idev at that moment and can tell if it should be on/off? > > > > > > It seems reasonable, but I can't remember why it is in the attach path > > > at the moment. > > > > This was the commit adding it in the alloc path: > > https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609- > 44eeda11b45e@arm.com/ > > > > The older code was doing a hwpt "upgrade" from !cc to cc: > > - /* > > - * Try to upgrade the domain we have, it is an iommu driver bug to > > - * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail > > - * enforce_cache_coherency when there are no devices attached to > the > > - * domain. > > - */ > > - if (idev->enforce_cache_coherency && !hwpt- > >enforce_cache_coherency) { > > - if (hwpt->domain->ops->enforce_cache_coherency) > > - hwpt->enforce_cache_coherency = > > - hwpt->domain->ops->enforce_cache_coherency( > > - hwpt->domain); > > > > If we remove the enforce_cc call in the attach path and let the > > driver decide whether to enforce or reject in attach_dev calls, > > there seems to be no point in tracking an enforce_cache_coherency > > flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU > > extension check in the vfio-compat pathway? > > I think the purpose of this code is to try to minimize the number of > domains. > > Otherwise we have a problem where the order devices are attached to > the domain decides how many domains you get. ie the first device > attached does not want CC (but is compatible with it) so we create a > non-CC domain in autodetect model this won't happen. If the first device is capable of enforce_cc then the domain will be created with enforce_cc. there is no "does not want CC" input in autodetect. > > Then later we attach a device that does want CC and now we are forced > to create a second iommu domain when upgrading the first domain would > have been fine. then in this case the 2nd device will reuse the domain. > > Hotplug is another scenario (though Intel driver does not support it, > and it looks broken) Can you elaborate how hotplug is broken? If device is hotplugged and failed to attach to an existing domain, then a new one will be created for it. there is indeed a broken case in KVM which cannot handle dynamic change of coherency due to hotplug. But that one is orthogonal to what we discussed here about how to decide cc in iommufd. > > Really, I hate this CC mechanism. It is only for Intel, can we just Intel and AMD. > punt it to userspace and have an intel 'want cc flag' for the entire > nesting path and forget about it? > I dislike it too. But still not get your point why adding such a flag can really simplify things. As explained before the only difference between autodetect and having a user flag just affects the decision of cc when creating a hwpt. whether we should upgrade in the attach path is an orthogonal open which imho is unnecessary and Nicoline's patches to remove that check then also remove this patch makes lot of sense to me. Then the necessity of introducing such a flag (plus adding a new interface to report cc to user space) is not obvious.
On Thu, Oct 19, 2023 at 01:56:01AM +0000, Tian, Kevin wrote: > > Otherwise we have a problem where the order devices are attached to > > the domain decides how many domains you get. ie the first device > > attached does not want CC (but is compatible with it) so we create a > > non-CC domain > > in autodetect model this won't happen. If the first device is capable > of enforce_cc then the domain will be created with enforce_cc. > > there is no "does not want CC" input in autodetect. > > > > Then later we attach a device that does want CC and now we are forced > > to create a second iommu domain when upgrading the first domain would > > have been fine. > > then in this case the 2nd device will reuse the domain. Then you have the reverse problem where the domain will not be CC when it should be. > > Hotplug is another scenario (though Intel driver does not support it, > > and it looks broken) > > Can you elaborate how hotplug is broken? If device is hotplugged and > failed to attach to an existing domain, then a new one will be created > for it. A non-cc domain will ask to be upgraded and the driver will let it happen even though it doesn't/can't fix the existing IOPTEs > there is indeed a broken case in KVM which cannot handle dynamic > change of coherency due to hotplug. But that one is orthogonal to > what we discussed here about how to decide cc in iommufd. That too > > Really, I hate this CC mechanism. It is only for Intel, can we just > > Intel and AMD. Nope, AMD just hardwires their IOMMU to always do CC enforcing. All this mess is only for Intel and their weird IOMMU that can only do the enforcement for a GPU. > > punt it to userspace and have an intel 'want cc flag' for the entire > > nesting path and forget about it? > > I dislike it too. But still not get your point why adding such a flag > can really simplify things. As explained before the only difference > between autodetect and having a user flag just affects the decision > of cc when creating a hwpt. whether we should upgrade in the > attach path is an orthogonal open which imho is unnecessary and > Nicoline's patches to remove that check then also remove this > patch makes lot of sense to me. I don't think we can remove it, it is supposed to provide consistency of result regardless of ordering. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, October 20, 2023 7:54 AM > > On Thu, Oct 19, 2023 at 01:56:01AM +0000, Tian, Kevin wrote: > > > > Otherwise we have a problem where the order devices are attached to > > > the domain decides how many domains you get. ie the first device > > > attached does not want CC (but is compatible with it) so we create a > > > non-CC domain > > > > in autodetect model this won't happen. If the first device is capable > > of enforce_cc then the domain will be created with enforce_cc. > > > > there is no "does not want CC" input in autodetect. > > > > > > Then later we attach a device that does want CC and now we are forced > > > to create a second iommu domain when upgrading the first domain > would > > > have been fine. > > > > then in this case the 2nd device will reuse the domain. > > Then you have the reverse problem where the domain will not be CC when > it should be. If the domain has been non-CC it's perfectly fine for the 2nd device with CC to reuse it. As long as there is one domain with non-CC then KVM has to have special treatment on wbinvd. In this case there is actually a benefit to use just one non-CC domain for all devices (either non-CC or CC). What we want to prevent is attaching a non-CC device to a CC domain or upgrade a non-CC domain to CC since in both case the non-CC device will be broken due to incompatible page table format. > > > > Hotplug is another scenario (though Intel driver does not support it, > > > and it looks broken) > > > > Can you elaborate how hotplug is broken? If device is hotplugged and > > failed to attach to an existing domain, then a new one will be created > > for it. > > A non-cc domain will ask to be upgraded and the driver will let it > happen even though it doesn't/can't fix the existing IOPTEs iommufd should not ask for upgrade at all. The CC attribute of domain should be fixed since creation time. Baolu will fix the intel-iommu driver accordingly. > > > there is indeed a broken case in KVM which cannot handle dynamic > > change of coherency due to hotplug. But that one is orthogonal to > > what we discussed here about how to decide cc in iommufd. > > That too > > > > Really, I hate this CC mechanism. It is only for Intel, can we just > > > > Intel and AMD. > > Nope, AMD just hardwires their IOMMU to always do CC enforcing. All > this mess is only for Intel and their weird IOMMU that can only do the > enforcement for a GPU. > > > > punt it to userspace and have an intel 'want cc flag' for the entire > > > nesting path and forget about it? > > > > I dislike it too. But still not get your point why adding such a flag > > can really simplify things. As explained before the only difference > > between autodetect and having a user flag just affects the decision > > of cc when creating a hwpt. whether we should upgrade in the > > attach path is an orthogonal open which imho is unnecessary and > > Nicoline's patches to remove that check then also remove this > > patch makes lot of sense to me. > > I don't think we can remove it, it is supposed to provide consistency > of result regardless of ordering. > Who cares about such consistency? sure the result is different due to order: 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to dev2 (CC) will succeed; 2) creating hwpt for dev2 (CC) then later attaching hwpt to dev1 (non-CC) will fail then the user should create a new hwpt for dev1; But the user shouldn't assume such explicit consistency since it's not defined in our uAPI. All we defined is that the attaching may fail due to incompatibility for whatever reason then the user can always try creating a new hwpt for the to-be-attached device. From this regard I don't see providing consistency of result is necessary. 😊
On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > What we want to prevent is attaching a non-CC device to a CC domain > or upgrade a non-CC domain to CC since in both case the non-CC > device will be broken due to incompatible page table format. [..] > Who cares about such consistency? sure the result is different due to order: > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to > dev2 (CC) will succeed; > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to > dev1 (non-CC) will fail then the user should create a new hwpt > for dev1; AH... So really what the Intel driver wants is not upgrade to CC but *downgrade* from CC. non-CC is the type that is universally applicable, so if we come across a non-CC capable device the proper/optimal thing is to degrade the HWPT and re-use it, not allocate a new HWPT. So the whole thing is upside down. As changing the IOPTEs in flight seems hard, and I don't want to see the Intel driver get slowed down to accomodate this, I think you are right to say this should be a creation time property only. I still think userspace should be able to select it so it can minimize the number of HWPTs required. > But the user shouldn't assume such explicit consistency since it's not > defined in our uAPI. All we defined is that the attaching may > fail due to incompatibility for whatever reason then the user can > always try creating a new hwpt for the to-be-attached device. From > this regard I don't see providing consistency of result is > necessary. 😊 Anyhow, OK, lets add a comment summarizing your points and remove the cc upgrade at attach time (sorry Nicolin/Yi!) It is easy to add a HWPT flag for this later if someone wants to optimize it. Jason
On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > > > What we want to prevent is attaching a non-CC device to a CC domain > > or upgrade a non-CC domain to CC since in both case the non-CC > > device will be broken due to incompatible page table format. > > [..] > > > Who cares about such consistency? sure the result is different due to order: > > > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to > > dev2 (CC) will succeed; > > > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to > > dev1 (non-CC) will fail then the user should create a new hwpt > > for dev1; > > AH... So really what the Intel driver wants is not upgrade to CC but > *downgrade* from CC. > > non-CC is the type that is universally applicable, so if we come > across a non-CC capable device the proper/optimal thing is to degrade > the HWPT and re-use it, not allocate a new HWPT. > > So the whole thing is upside down. > > As changing the IOPTEs in flight seems hard, and I don't want to see > the Intel driver get slowed down to accomodate this, I think you are > right to say this should be a creation time property only. > > I still think userspace should be able to select it so it can minimize > the number of HWPTs required. > > > But the user shouldn't assume such explicit consistency since it's not > > defined in our uAPI. All we defined is that the attaching may > > fail due to incompatibility for whatever reason then the user can > > always try creating a new hwpt for the to-be-attached device. From > > this regard I don't see providing consistency of result is > > necessary. 😊 > > Anyhow, OK, lets add a comment summarizing your points and remove the > cc upgrade at attach time (sorry Nicolin/Yi!) Ack. I will send a small removal series. I assume it should CC stable tree also? And where should we add this comment? Kdoc of the alloc uAPI? Thanks! Nicolin > It is easy to add a HWPT flag for this later if someone wants to > optimize it. > > Jason
On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote: > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote: > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > > > > > What we want to prevent is attaching a non-CC device to a CC domain > > > or upgrade a non-CC domain to CC since in both case the non-CC > > > device will be broken due to incompatible page table format. > > > > [..] > > > > > Who cares about such consistency? sure the result is different due to order: > > > > > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to > > > dev2 (CC) will succeed; > > > > > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to > > > dev1 (non-CC) will fail then the user should create a new hwpt > > > for dev1; > > > > AH... So really what the Intel driver wants is not upgrade to CC but > > *downgrade* from CC. > > > > non-CC is the type that is universally applicable, so if we come > > across a non-CC capable device the proper/optimal thing is to degrade > > the HWPT and re-use it, not allocate a new HWPT. > > > > So the whole thing is upside down. > > > > As changing the IOPTEs in flight seems hard, and I don't want to see > > the Intel driver get slowed down to accomodate this, I think you are > > right to say this should be a creation time property only. > > > > I still think userspace should be able to select it so it can minimize > > the number of HWPTs required. > > > > > But the user shouldn't assume such explicit consistency since it's not > > > defined in our uAPI. All we defined is that the attaching may > > > fail due to incompatibility for whatever reason then the user can > > > always try creating a new hwpt for the to-be-attached device. From > > > this regard I don't see providing consistency of result is > > > necessary. 😊 > > > > Anyhow, OK, lets add a comment summarizing your points and remove the > > cc upgrade at attach time (sorry Nicolin/Yi!) > > Ack. I will send a small removal series. I assume it should CC > stable tree also? No, it seems more like tidying that fixing a functional issue, do I misunderstand? > And where should we add this comment? Kdoc of > the alloc uAPI? Maybe right in front of the only enforce_cc op callback? Jason
On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote: > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote: > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote: > > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > > > > > > > What we want to prevent is attaching a non-CC device to a CC domain > > > > or upgrade a non-CC domain to CC since in both case the non-CC > > > > device will be broken due to incompatible page table format. > > > > > > [..] > > > > > > > Who cares about such consistency? sure the result is different due to order: > > > > > > > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to > > > > dev2 (CC) will succeed; > > > > > > > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to > > > > dev1 (non-CC) will fail then the user should create a new hwpt > > > > for dev1; > > > > > > AH... So really what the Intel driver wants is not upgrade to CC but > > > *downgrade* from CC. > > > > > > non-CC is the type that is universally applicable, so if we come > > > across a non-CC capable device the proper/optimal thing is to degrade > > > the HWPT and re-use it, not allocate a new HWPT. > > > > > > So the whole thing is upside down. > > > > > > As changing the IOPTEs in flight seems hard, and I don't want to see > > > the Intel driver get slowed down to accomodate this, I think you are > > > right to say this should be a creation time property only. > > > > > > I still think userspace should be able to select it so it can minimize > > > the number of HWPTs required. > > > > > > > But the user shouldn't assume such explicit consistency since it's not > > > > defined in our uAPI. All we defined is that the attaching may > > > > fail due to incompatibility for whatever reason then the user can > > > > always try creating a new hwpt for the to-be-attached device. From > > > > this regard I don't see providing consistency of result is > > > > necessary. 😊 > > > > > > Anyhow, OK, lets add a comment summarizing your points and remove the > > > cc upgrade at attach time (sorry Nicolin/Yi!) > > > > Ack. I will send a small removal series. I assume it should CC > > stable tree also? > > No, it seems more like tidying that fixing a functional issue, do I > misunderstand? Hmm. Maybe the misunderstanding is mine -- Kevin was asking if it was already a bug and you answered yes: https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/ If this shouldn't be a bug fix, I could just merge them into a single tidying patch and add the comments you suggested below. > > And where should we add this comment? Kdoc of > > the alloc uAPI? > > Maybe right in front of the only enforce_cc op callback? OK. How about this? Might be a bit verbose though: /* * enforce_cache_coherenc must be determined during the HWPT allocation. * Note that a HWPT (non-CC) created for a device (non-CC) can be later * reused by another device (either non-CC or CC). However, A HWPT (CC) * created for a device (CC) cannot be reused by another device (non-CC) * but only devices (CC). Instead user space in this case would need to * allocate a separate HWPT (non-CC). */ Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, October 23, 2023 8:18 AM > > On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote: > > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote: > > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote: > > > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > > > > > > > > > But the user shouldn't assume such explicit consistency since it's not > > > > > defined in our uAPI. All we defined is that the attaching may > > > > > fail due to incompatibility for whatever reason then the user can > > > > > always try creating a new hwpt for the to-be-attached device. From > > > > > this regard I don't see providing consistency of result is > > > > > necessary. 😊 > > > > > > > > Anyhow, OK, lets add a comment summarizing your points and remove > the > > > > cc upgrade at attach time (sorry Nicolin/Yi!) > > > > > > Ack. I will send a small removal series. I assume it should CC > > > stable tree also? > > > > No, it seems more like tidying that fixing a functional issue, do I > > misunderstand? > > Hmm. Maybe the misunderstanding is mine -- Kevin was asking if > it was already a bug and you answered yes: > https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/ > currently intel-iommu driver already rejects 1) enforcing CC on a domain which is already attached to non-CC device and 2) attaching a non-CC device to a domain which has enforce_cc. so there is no explorable bug to fix in stable tree. Just logically intel-iommu driver doesn't support enforcing Cc on a domain which is capable of enforce-cc and already has valid mappings. Then it should add proper check to prevent it from being attempted by future usages.
On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote: > > > And where should we add this comment? Kdoc of > > > the alloc uAPI? > > > > Maybe right in front of the only enforce_cc op callback? > > OK. How about this? Might be a bit verbose though: > /* > * enforce_cache_coherenc must be determined during the HWPT allocation. > * Note that a HWPT (non-CC) created for a device (non-CC) can be later > * reused by another device (either non-CC or CC). However, A HWPT (CC) > * created for a device (CC) cannot be reused by another device (non-CC) > * but only devices (CC). Instead user space in this case would need to > * allocate a separate HWPT (non-CC). > */ Ok, fix the spello in enforce_cache_coherenc Jason
On Mon, Oct 23, 2023 at 02:53:20AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Monday, October 23, 2023 8:18 AM > > > > On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote: > > > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote: > > > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote: > > > > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote: > > > > > > > > > > > But the user shouldn't assume such explicit consistency since it's not > > > > > > defined in our uAPI. All we defined is that the attaching may > > > > > > fail due to incompatibility for whatever reason then the user can > > > > > > always try creating a new hwpt for the to-be-attached device. From > > > > > > this regard I don't see providing consistency of result is > > > > > > necessary. 😊 > > > > > > > > > > Anyhow, OK, lets add a comment summarizing your points and remove > > the > > > > > cc upgrade at attach time (sorry Nicolin/Yi!) > > > > > > > > Ack. I will send a small removal series. I assume it should CC > > > > stable tree also? > > > > > > No, it seems more like tidying that fixing a functional issue, do I > > > misunderstand? > > > > Hmm. Maybe the misunderstanding is mine -- Kevin was asking if > > it was already a bug and you answered yes: > > https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/ > > > > currently intel-iommu driver already rejects 1) enforcing CC on > a domain which is already attached to non-CC device and > 2) attaching a non-CC device to a domain which has enforce_cc. > > so there is no explorable bug to fix in stable tree. I see. Thanks!
On Mon, Oct 23, 2023 at 10:59:35AM -0300, Jason Gunthorpe wrote: > On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote: > > > > > And where should we add this comment? Kdoc of > > > > the alloc uAPI? > > > > > > Maybe right in front of the only enforce_cc op callback? > > > > OK. How about this? Might be a bit verbose though: > > /* > > * enforce_cache_coherenc must be determined during the HWPT allocation. > > * Note that a HWPT (non-CC) created for a device (non-CC) can be later > > * reused by another device (either non-CC or CC). However, A HWPT (CC) > > * created for a device (CC) cannot be reused by another device (non-CC) > > * but only devices (CC). Instead user space in this case would need to > > * allocate a separate HWPT (non-CC). > > */ > > Ok, fix the spello in enforce_cache_coherenc Oops. I also found the existing piece sorta says a similar thing: * Set the coherency mode before we do iopt_table_add_domain() as some * iommus have a per-PTE bit that controls it and need to decide before * doing any maps. So, did this and sending v3: - * enforce_cache_coherenc must be determined during the HWPT allocation. + * The cache coherency mode must be configured here and unchanged later.
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index eb120f70a3e3..104dd061a2a3 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -305,12 +305,16 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup, * domain after request_irq(). If it is not done interrupts will not * work on this domain. * + * Note: always set up a msi_cookie on a kernel-manage hw_pagetable. + * * FIXME: This is conceptually broken for iommufd since we want to allow * userspace to change the domains, eg switch from an identity IOAS to a * DMA IOAS. There is currently no way to create a MSI window that * matches what the IRQ layer actually expects in a newly created * domain. */ + if (hwpt->user_managed) + hwpt = hwpt->parent; if (sw_msi_start != PHYS_ADDR_MAX && !hwpt->msi_cookie) { rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start); if (rc) diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index dc3e11a23acf..90fd65859e28 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -152,6 +152,10 @@ iommufd_user_managed_hwpt_alloc(struct iommufd_ctx *ictx, int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt) { + /* Always enforce cache coherency on a kernel-managed hw_pagetable */ + if (hwpt->user_managed) + hwpt = hwpt->parent; + if (hwpt->enforce_cache_coherency) return 0;