[v4,08/17] iommufd: Always setup MSI and anforce cc on kernel-managed domains

Message ID 20230921075138.124099-9-yi.l.liu@intel.com
State New
Headers
Series iommufd: Add nesting infrastructure |

Commit Message

Yi Liu Sept. 21, 2023, 7:51 a.m. UTC
  From: Nicolin Chen <nicolinc@nvidia.com>

Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt things.
So, they should be only setup on kernel-managed domains. If the attaching
domain is a user-managed domain, redirect the hwpt to hwpt->parent to do
it correctly.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c       | 4 ++++
 drivers/iommu/iommufd/hw_pagetable.c | 4 ++++
 2 files changed, 8 insertions(+)
  

Comments

Tian, Kevin Sept. 26, 2023, 8:16 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, September 21, 2023 3:51 PM
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt
> things.
> So, they should be only setup on kernel-managed domains. If the attaching
> domain is a user-managed domain, redirect the hwpt to hwpt->parent to do
> it correctly.
> 

No redirection. The parent should already have the configuration done
when it's created. It shouldn't be triggered in the nesting path.
  
Nicolin Chen Oct. 14, 2023, 12:44 a.m. UTC | #2
On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, September 21, 2023 3:51 PM
> >
> > From: Nicolin Chen <nicolinc@nvidia.com>
> >
> > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt
> > things.
> > So, they should be only setup on kernel-managed domains. If the attaching
> > domain is a user-managed domain, redirect the hwpt to hwpt->parent to do
> > it correctly.
> >
> 
> No redirection. The parent should already have the configuration done
> when it's created. It shouldn't be triggered in the nesting path.

iommufd_hw_pagetable_enforce_cc() is not only called in alloc(),
but also in hwpt_attach/replace() if cc is not enforced by the
alloc() because the idev that initiates the hwpt_alloc() might
not have idev->enforce_cache_coherency. Only when another idev
that has idev->enforce_cache_coherency attaches to the shared
hwpt, the cc configuration would be done.

In a nesting case encountering the same situation above, the S2
hwpt is allocated without the iommufd_hw_pagetable_enforce_cc().
But the 2nd idev that has idev->enforce_cache_coherency might
attach directly to the S1 domain/hwpt without touching the S2
domain (for the same VM, S2 domain can be shared). In this case,
without a redirection, the iommufd_hw_pagetable_enforce_cc()
would be missed.

Any thought?

Nic
  
Tian, Kevin Oct. 16, 2023, 8:48 a.m. UTC | #3
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, October 14, 2023 8:45 AM
> 
> On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Thursday, September 21, 2023 3:51 PM
> > >
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > >
> > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt
> > > things.
> > > So, they should be only setup on kernel-managed domains. If the
> attaching
> > > domain is a user-managed domain, redirect the hwpt to hwpt->parent to
> do
> > > it correctly.
> > >
> >
> > No redirection. The parent should already have the configuration done
> > when it's created. It shouldn't be triggered in the nesting path.
> 
> iommufd_hw_pagetable_enforce_cc() is not only called in alloc(),
> but also in hwpt_attach/replace() if cc is not enforced by the
> alloc() because the idev that initiates the hwpt_alloc() might
> not have idev->enforce_cache_coherency. Only when another idev
> that has idev->enforce_cache_coherency attaches to the shared
> hwpt, the cc configuration would be done.
> 

is this a bug already? If the 1st device doesn't have enforce_cc in its
iommu, setting the snp bit in the hwpt would lead to reserved
bit violation.

another problem is that intel_iommu_enforce_cache_coherency()
doesn't update existing entries. It only sets a domain flag to affect
future mappings. so it means the 2nd idev is also broken.

The simplest option is to follow vfio type1 i.e. don't mix devices
with different enforce_cc in one domain.
  
Jason Gunthorpe Oct. 16, 2023, 11:57 a.m. UTC | #4
On Mon, Oct 16, 2023 at 08:48:03AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, October 14, 2023 8:45 AM
> > 
> > On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Thursday, September 21, 2023 3:51 PM
> > > >
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > >
> > > > Now enforce_cache_coherency and msi_cookie are kernel-managed hwpt
> > > > things.
> > > > So, they should be only setup on kernel-managed domains. If the
> > attaching
> > > > domain is a user-managed domain, redirect the hwpt to hwpt->parent to
> > do
> > > > it correctly.
> > > >
> > >
> > > No redirection. The parent should already have the configuration done
> > > when it's created. It shouldn't be triggered in the nesting path.
> > 
> > iommufd_hw_pagetable_enforce_cc() is not only called in alloc(),
> > but also in hwpt_attach/replace() if cc is not enforced by the
> > alloc() because the idev that initiates the hwpt_alloc() might
> > not have idev->enforce_cache_coherency. Only when another idev
> > that has idev->enforce_cache_coherency attaches to the shared
> > hwpt, the cc configuration would be done.
> 
> is this a bug already? If the 1st device doesn't have enforce_cc in its
> iommu, setting the snp bit in the hwpt would lead to reserved
> bit violation.

I suspect there are technically some gaps in the intel driver, yes..
 
> another problem is that intel_iommu_enforce_cache_coherency()
> doesn't update existing entries. It only sets a domain flag to affect
> future mappings. so it means the 2nd idev is also broken.

This is such a gap, intel driver should not permit that.

> The simplest option is to follow vfio type1 i.e. don't mix devices
> with different enforce_cc in one domain.

This is why I wanted to get rid of this bad mechanism going forward.

Manually created hwpt should have a manual specification of cc and
then we don't have so many problems.

It means userspace needs to compute if they want to use CC or not, but
userspace already needs to figure this out since without autodomains
it must create two hwpts manually anyhow.

Jason
  
Tian, Kevin Oct. 17, 2023, 8:52 a.m. UTC | #5
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, October 16, 2023 7:58 PM
> 
> On Mon, Oct 16, 2023 at 08:48:03AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, October 14, 2023 8:45 AM
> > >
> > > On Tue, Sep 26, 2023 at 01:16:35AM -0700, Tian, Kevin wrote:
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Thursday, September 21, 2023 3:51 PM
> > > > >
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > >
> > > > > Now enforce_cache_coherency and msi_cookie are kernel-managed
> hwpt
> > > > > things.
> > > > > So, they should be only setup on kernel-managed domains. If the
> > > attaching
> > > > > domain is a user-managed domain, redirect the hwpt to hwpt->parent
> to
> > > do
> > > > > it correctly.
> > > > >
> > > >
> > > > No redirection. The parent should already have the configuration done
> > > > when it's created. It shouldn't be triggered in the nesting path.
> > >
> > > iommufd_hw_pagetable_enforce_cc() is not only called in alloc(),
> > > but also in hwpt_attach/replace() if cc is not enforced by the
> > > alloc() because the idev that initiates the hwpt_alloc() might
> > > not have idev->enforce_cache_coherency. Only when another idev
> > > that has idev->enforce_cache_coherency attaches to the shared
> > > hwpt, the cc configuration would be done.
> >
> > is this a bug already? If the 1st device doesn't have enforce_cc in its
> > iommu, setting the snp bit in the hwpt would lead to reserved
> > bit violation.
> 
> I suspect there are technically some gaps in the intel driver, yes..

double checked. intel driver is doing right thing now:

intel_iommu_enforce_cache_coherency()
  domain_support_force_snooping()

static bool domain_support_force_snooping(struct dmar_domain *domain)
{
	struct device_domain_info *info;
	bool support = true;

	assert_spin_locked(&domain->lock);
	list_for_each_entry(info, &domain->devices, link) {
		if (!ecap_sc_support(info->iommu->ecap)) {
			support = false;
			break;
		}
	}

	return support;
}

> 
> > another problem is that intel_iommu_enforce_cache_coherency()
> > doesn't update existing entries. It only sets a domain flag to affect
> > future mappings. so it means the 2nd idev is also broken.
> 
> This is such a gap, intel driver should not permit that.

yes. @Baolu, can you add a fix?

> 
> > The simplest option is to follow vfio type1 i.e. don't mix devices
> > with different enforce_cc in one domain.
> 
> This is why I wanted to get rid of this bad mechanism going forward.
> 
> Manually created hwpt should have a manual specification of cc and
> then we don't have so many problems.
> 
> It means userspace needs to compute if they want to use CC or not, but
> userspace already needs to figure this out since without autodomains
> it must create two hwpts manually anyhow.
> 

Now there is no interface reporting enforce_cc to userspace.

Actually the user doesn't need to know enforce_cc. As long as there is
an attach incompatibility error the user has to create a 2nd hwpt for
the to-be-attached device then enforce_cc will be handled automatically
in hwpt_alloc.

I prefer to removing enforce_cc in attach fn completely then no parent
trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to
figure out the attaching compatibility:

  - attaching a idev with matching enforce_cc as hwpt is always allowed.
  - attaching a idev w/o enforce_cc to a hwpt with enforce_cc is rejected.
  - attaching a idev w/ enforce_cc to a hwpt w/o enforce_cc succeeds with
    hwpt continuing to be w/o enforce_cc. No promotion.

above has been guaranteed by intel iommu driver.
  
Tian, Kevin Oct. 17, 2023, 9:05 a.m. UTC | #6
> From: Tian, Kevin
> Sent: Tuesday, October 17, 2023 4:53 PM
> >
> > This is why I wanted to get rid of this bad mechanism going forward.
> >
> > Manually created hwpt should have a manual specification of cc and
> > then we don't have so many problems.
> >

Actually I don't see much difference between manual specification of
cc vs. indirect specification of cc via specified idev in hwpt_alloc. What
really matters is how we want to deal with the compatibility in the
following attach path.
  
Jason Gunthorpe Oct. 17, 2023, 3:53 p.m. UTC | #7
On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote:

> > This is why I wanted to get rid of this bad mechanism going forward.
> > 
> > Manually created hwpt should have a manual specification of cc and
> > then we don't have so many problems.
> > 
> > It means userspace needs to compute if they want to use CC or not, but
> > userspace already needs to figure this out since without autodomains
> > it must create two hwpts manually anyhow.
> > 
> 
> Now there is no interface reporting enforce_cc to userspace.

Yes..

> Actually the user doesn't need to know enforce_cc. As long as there is
> an attach incompatibility error the user has to create a 2nd hwpt for
> the to-be-attached device then enforce_cc will be handled automatically
> in hwpt_alloc.

Uh, OK we can do that.. It is a bit hacky because we don't know the
order do
 
> I prefer to removing enforce_cc in attach fn completely then no parent
> trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to
> figure out the attaching compatibility:

You are basically saying to set the cc mode during creation because we
know the idev at that moment and can tell if it should be on/off?

It seems reasonable, but I can't remember why it is in the attach path
at the moment.

Jason
  
Nicolin Chen Oct. 17, 2023, 7:58 p.m. UTC | #8
On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote:
> > I prefer to removing enforce_cc in attach fn completely then no parent
> > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to
> > figure out the attaching compatibility:
> 
> You are basically saying to set the cc mode during creation because we
> know the idev at that moment and can tell if it should be on/off?
> 
> It seems reasonable, but I can't remember why it is in the attach path
> at the moment.

This was the commit adding it in the alloc path:
https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609-44eeda11b45e@arm.com/

The older code was doing a hwpt "upgrade" from !cc to cc:
-       /*
-        * Try to upgrade the domain we have, it is an iommu driver bug to
-        * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
-        * enforce_cache_coherency when there are no devices attached to the
-        * domain.
-        */
-       if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) {
-               if (hwpt->domain->ops->enforce_cache_coherency)
-                       hwpt->enforce_cache_coherency =
-                               hwpt->domain->ops->enforce_cache_coherency(
-                                       hwpt->domain);

If we remove the enforce_cc call in the attach path and let the
driver decide whether to enforce or reject in attach_dev calls,
there seems to be no point in tracking an enforce_cache_coherency
flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU
extension check in the vfio-compat pathway?

Thanks
Nic
  
Baolu Lu Oct. 18, 2023, 2:32 a.m. UTC | #9
On 10/17/23 4:52 PM, Tian, Kevin wrote:
>>> another problem is that intel_iommu_enforce_cache_coherency()
>>> doesn't update existing entries. It only sets a domain flag to affect
>>> future mappings. so it means the 2nd idev is also broken.
>> This is such a gap, intel driver should not permit that.
> yes. @Baolu, can you add a fix?

Sure thing!

Best regards,
baolu
  
Jason Gunthorpe Oct. 18, 2023, 4:51 p.m. UTC | #10
On Tue, Oct 17, 2023 at 12:58:39PM -0700, Nicolin Chen wrote:
> On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote:
> > > I prefer to removing enforce_cc in attach fn completely then no parent
> > > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver to
> > > figure out the attaching compatibility:
> > 
> > You are basically saying to set the cc mode during creation because we
> > know the idev at that moment and can tell if it should be on/off?
> > 
> > It seems reasonable, but I can't remember why it is in the attach path
> > at the moment.
> 
> This was the commit adding it in the alloc path:
> https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609-44eeda11b45e@arm.com/
> 
> The older code was doing a hwpt "upgrade" from !cc to cc:
> -       /*
> -        * Try to upgrade the domain we have, it is an iommu driver bug to
> -        * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
> -        * enforce_cache_coherency when there are no devices attached to the
> -        * domain.
> -        */
> -       if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) {
> -               if (hwpt->domain->ops->enforce_cache_coherency)
> -                       hwpt->enforce_cache_coherency =
> -                               hwpt->domain->ops->enforce_cache_coherency(
> -                                       hwpt->domain);
> 
> If we remove the enforce_cc call in the attach path and let the
> driver decide whether to enforce or reject in attach_dev calls,
> there seems to be no point in tracking an enforce_cache_coherency
> flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU
> extension check in the vfio-compat pathway?

I think the purpose of this code is to try to minimize the number of
domains.

Otherwise we have a problem where the order devices are attached to
the domain decides how many domains you get. ie the first device
attached does not want CC (but is compatible with it) so we create a
non-CC domain

Then later we attach a device that does want CC and now we are forced
to create a second iommu domain when upgrading the first domain would
have been fine.

Hotplug is another scenario (though Intel driver does not support it,
and it looks broken)

Really, I hate this CC mechanism. It is only for Intel, can we just
punt it to userspace and have an intel 'want cc flag' for the entire
nesting path and forget about it?

Jason
  
Tian, Kevin Oct. 19, 2023, 1:56 a.m. UTC | #11
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, October 19, 2023 12:51 AM
> 
> On Tue, Oct 17, 2023 at 12:58:39PM -0700, Nicolin Chen wrote:
> > On Tue, Oct 17, 2023 at 12:53:01PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 17, 2023 at 08:52:49AM +0000, Tian, Kevin wrote:
> > > > I prefer to removing enforce_cc in attach fn completely then no parent
> > > > trick in this patch. Just keep it in hwpt_alloc and leave to iommu driver
> to
> > > > figure out the attaching compatibility:
> > >
> > > You are basically saying to set the cc mode during creation because we
> > > know the idev at that moment and can tell if it should be on/off?
> > >
> > > It seems reasonable, but I can't remember why it is in the attach path
> > > at the moment.
> >
> > This was the commit adding it in the alloc path:
> > https://lore.kernel.org/linux-iommu/8e897628-61fa-b3fb-b609-
> 44eeda11b45e@arm.com/
> >
> > The older code was doing a hwpt "upgrade" from !cc to cc:
> > -       /*
> > -        * Try to upgrade the domain we have, it is an iommu driver bug to
> > -        * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
> > -        * enforce_cache_coherency when there are no devices attached to
> the
> > -        * domain.
> > -        */
> > -       if (idev->enforce_cache_coherency && !hwpt-
> >enforce_cache_coherency) {
> > -               if (hwpt->domain->ops->enforce_cache_coherency)
> > -                       hwpt->enforce_cache_coherency =
> > -                               hwpt->domain->ops->enforce_cache_coherency(
> > -                                       hwpt->domain);
> >
> > If we remove the enforce_cc call in the attach path and let the
> > driver decide whether to enforce or reject in attach_dev calls,
> > there seems to be no point in tracking an enforce_cache_coherency
> > flag in the IOMMUFD pathway but only for the VFIO_DMA_CC_IOMMU
> > extension check in the vfio-compat pathway?
> 
> I think the purpose of this code is to try to minimize the number of
> domains.
> 
> Otherwise we have a problem where the order devices are attached to
> the domain decides how many domains you get. ie the first device
> attached does not want CC (but is compatible with it) so we create a
> non-CC domain

in autodetect model this won't happen. If the first device is capable
of enforce_cc then the domain will be created with enforce_cc.

there is no "does not want CC" input in autodetect.

> 
> Then later we attach a device that does want CC and now we are forced
> to create a second iommu domain when upgrading the first domain would
> have been fine.

then in this case the 2nd device will reuse the domain.

> 
> Hotplug is another scenario (though Intel driver does not support it,
> and it looks broken)

Can you elaborate how hotplug is broken? If device is hotplugged and
failed to attach to an existing domain, then a new one will be created
for it.

there is indeed a broken case in KVM which cannot handle dynamic
change of coherency due to hotplug. But that one is orthogonal to
what we discussed here about how to decide cc in iommufd.

> 
> Really, I hate this CC mechanism. It is only for Intel, can we just

Intel and AMD.

> punt it to userspace and have an intel 'want cc flag' for the entire
> nesting path and forget about it?
> 

I dislike it too. But still not get your point why adding such a flag
can really simplify things. As explained before the only difference
between autodetect and having a user flag just affects the decision
of cc when creating a hwpt. whether we should upgrade in the
attach path is an orthogonal open which imho is unnecessary and
Nicoline's patches to remove that check then also remove this
patch makes lot of sense to me.

Then the necessity of introducing such a flag (plus adding a new
interface to report cc to user space) is not obvious.
  
Jason Gunthorpe Oct. 19, 2023, 11:53 p.m. UTC | #12
On Thu, Oct 19, 2023 at 01:56:01AM +0000, Tian, Kevin wrote:

> > Otherwise we have a problem where the order devices are attached to
> > the domain decides how many domains you get. ie the first device
> > attached does not want CC (but is compatible with it) so we create a
> > non-CC domain
> 
> in autodetect model this won't happen. If the first device is capable
> of enforce_cc then the domain will be created with enforce_cc.
> 
> there is no "does not want CC" input in autodetect.
> > 
> > Then later we attach a device that does want CC and now we are forced
> > to create a second iommu domain when upgrading the first domain would
> > have been fine.
> 
> then in this case the 2nd device will reuse the domain.

Then you have the reverse problem where the domain will not be CC when
it should be.

> > Hotplug is another scenario (though Intel driver does not support it,
> > and it looks broken)
> 
> Can you elaborate how hotplug is broken? If device is hotplugged and
> failed to attach to an existing domain, then a new one will be created
> for it.

A non-cc domain will ask to be upgraded and the driver will let it
happen even though it doesn't/can't fix the existing IOPTEs

> there is indeed a broken case in KVM which cannot handle dynamic
> change of coherency due to hotplug. But that one is orthogonal to
> what we discussed here about how to decide cc in iommufd.

That too
  
> > Really, I hate this CC mechanism. It is only for Intel, can we just
> 
> Intel and AMD.

Nope, AMD just hardwires their IOMMU to always do CC enforcing. All
this mess is only for Intel and their weird IOMMU that can only do the
enforcement for a GPU.

> > punt it to userspace and have an intel 'want cc flag' for the entire
> > nesting path and forget about it?
> 
> I dislike it too. But still not get your point why adding such a flag
> can really simplify things. As explained before the only difference
> between autodetect and having a user flag just affects the decision
> of cc when creating a hwpt. whether we should upgrade in the
> attach path is an orthogonal open which imho is unnecessary and
> Nicoline's patches to remove that check then also remove this
> patch makes lot of sense to me.

I don't think we can remove it, it is supposed to provide consistency
of result regardless of ordering.

Jason
  
Tian, Kevin Oct. 20, 2023, 2:43 a.m. UTC | #13
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, October 20, 2023 7:54 AM
> 
> On Thu, Oct 19, 2023 at 01:56:01AM +0000, Tian, Kevin wrote:
> 
> > > Otherwise we have a problem where the order devices are attached to
> > > the domain decides how many domains you get. ie the first device
> > > attached does not want CC (but is compatible with it) so we create a
> > > non-CC domain
> >
> > in autodetect model this won't happen. If the first device is capable
> > of enforce_cc then the domain will be created with enforce_cc.
> >
> > there is no "does not want CC" input in autodetect.
> > >
> > > Then later we attach a device that does want CC and now we are forced
> > > to create a second iommu domain when upgrading the first domain
> would
> > > have been fine.
> >
> > then in this case the 2nd device will reuse the domain.
> 
> Then you have the reverse problem where the domain will not be CC when
> it should be.

If the domain has been non-CC it's perfectly fine for the 2nd device with CC
to reuse it. As long as there is one domain with non-CC then KVM has to
have special treatment on wbinvd. In this case there is actually a benefit
to use just one non-CC domain for all devices (either non-CC or CC).

What we want to prevent is attaching a non-CC device to a CC domain
or upgrade a non-CC domain  to CC since in both case the non-CC device
will be broken due to incompatible page table format.

> 
> > > Hotplug is another scenario (though Intel driver does not support it,
> > > and it looks broken)
> >
> > Can you elaborate how hotplug is broken? If device is hotplugged and
> > failed to attach to an existing domain, then a new one will be created
> > for it.
> 
> A non-cc domain will ask to be upgraded and the driver will let it
> happen even though it doesn't/can't fix the existing IOPTEs

iommufd should not ask for upgrade at all. The CC attribute of domain
should be fixed since creation time.

Baolu will fix the intel-iommu driver accordingly.

> 
> > there is indeed a broken case in KVM which cannot handle dynamic
> > change of coherency due to hotplug. But that one is orthogonal to
> > what we discussed here about how to decide cc in iommufd.
> 
> That too
> 
> > > Really, I hate this CC mechanism. It is only for Intel, can we just
> >
> > Intel and AMD.
> 
> Nope, AMD just hardwires their IOMMU to always do CC enforcing. All
> this mess is only for Intel and their weird IOMMU that can only do the
> enforcement for a GPU.
> 
> > > punt it to userspace and have an intel 'want cc flag' for the entire
> > > nesting path and forget about it?
> >
> > I dislike it too. But still not get your point why adding such a flag
> > can really simplify things. As explained before the only difference
> > between autodetect and having a user flag just affects the decision
> > of cc when creating a hwpt. whether we should upgrade in the
> > attach path is an orthogonal open which imho is unnecessary and
> > Nicoline's patches to remove that check then also remove this
> > patch makes lot of sense to me.
> 
> I don't think we can remove it, it is supposed to provide consistency
> of result regardless of ordering.
> 

Who cares about such consistency? sure the result is different due to order:

1) creating hwpt for dev1 (non-CC) then later attaching hwpt to
    dev2 (CC) will succeed;

2) creating hwpt for dev2 (CC) then later attaching hwpt to
    dev1 (non-CC) will fail then the user should create a new hwpt
    for dev1;

But the user shouldn't assume such explicit consistency since it's not
defined in our uAPI. All we defined is that the attaching may
fail due to incompatibility for whatever reason then the user can
always try creating a new hwpt for the to-be-attached device. From
this regard I don't see providing consistency of result is necessary. 😊
  
Jason Gunthorpe Oct. 20, 2023, 1:55 p.m. UTC | #14
On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:

> What we want to prevent is attaching a non-CC device to a CC domain
> or upgrade a non-CC domain to CC since in both case the non-CC
> device will be broken due to incompatible page table format.

[..]

> Who cares about such consistency? sure the result is different due to order:
> 
> 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to
>     dev2 (CC) will succeed;
> 
> 2) creating hwpt for dev2 (CC) then later attaching hwpt to
>     dev1 (non-CC) will fail then the user should create a new hwpt
>     for dev1;

AH... So really what the Intel driver wants is not upgrade to CC but
*downgrade* from CC.

non-CC is the type that is universally applicable, so if we come
across a non-CC capable device the proper/optimal thing is to degrade
the HWPT and re-use it, not allocate a new HWPT.

So the whole thing is upside down.

As changing the IOPTEs in flight seems hard, and I don't want to see
the Intel driver get slowed down to accomodate this, I think you are
right to say this should be a creation time property only.

I still think userspace should be able to select it so it can minimize
the number of HWPTs required.

> But the user shouldn't assume such explicit consistency since it's not
> defined in our uAPI. All we defined is that the attaching may
> fail due to incompatibility for whatever reason then the user can
> always try creating a new hwpt for the to-be-attached device. From
> this regard I don't see providing consistency of result is
> necessary. 😊

Anyhow, OK, lets add a comment summarizing your points and remove the
cc upgrade at attach time (sorry Nicolin/Yi!)

It is easy to add a HWPT flag for this later if someone wants to
optimize it.

Jason
  
Nicolin Chen Oct. 20, 2023, 6:59 p.m. UTC | #15
On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:
> 
> > What we want to prevent is attaching a non-CC device to a CC domain
> > or upgrade a non-CC domain to CC since in both case the non-CC
> > device will be broken due to incompatible page table format.
> 
> [..]
> 
> > Who cares about such consistency? sure the result is different due to order:
> > 
> > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to
> >     dev2 (CC) will succeed;
> > 
> > 2) creating hwpt for dev2 (CC) then later attaching hwpt to
> >     dev1 (non-CC) will fail then the user should create a new hwpt
> >     for dev1;
> 
> AH... So really what the Intel driver wants is not upgrade to CC but
> *downgrade* from CC.
> 
> non-CC is the type that is universally applicable, so if we come
> across a non-CC capable device the proper/optimal thing is to degrade
> the HWPT and re-use it, not allocate a new HWPT.
> 
> So the whole thing is upside down.
> 
> As changing the IOPTEs in flight seems hard, and I don't want to see
> the Intel driver get slowed down to accomodate this, I think you are
> right to say this should be a creation time property only.
> 
> I still think userspace should be able to select it so it can minimize
> the number of HWPTs required.
> 
> > But the user shouldn't assume such explicit consistency since it's not
> > defined in our uAPI. All we defined is that the attaching may
> > fail due to incompatibility for whatever reason then the user can
> > always try creating a new hwpt for the to-be-attached device. From
> > this regard I don't see providing consistency of result is
> > necessary. 😊
> 
> Anyhow, OK, lets add a comment summarizing your points and remove the
> cc upgrade at attach time (sorry Nicolin/Yi!)

Ack. I will send a small removal series. I assume it should CC
stable tree also? And where should we add this comment? Kdoc of
the alloc uAPI?

Thanks!
Nicolin

> It is easy to add a HWPT flag for this later if someone wants to
> optimize it.
> 
> Jason
  
Jason Gunthorpe Oct. 21, 2023, 4:38 p.m. UTC | #16
On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote:
> On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:
> > 
> > > What we want to prevent is attaching a non-CC device to a CC domain
> > > or upgrade a non-CC domain to CC since in both case the non-CC
> > > device will be broken due to incompatible page table format.
> > 
> > [..]
> > 
> > > Who cares about such consistency? sure the result is different due to order:
> > > 
> > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to
> > >     dev2 (CC) will succeed;
> > > 
> > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to
> > >     dev1 (non-CC) will fail then the user should create a new hwpt
> > >     for dev1;
> > 
> > AH... So really what the Intel driver wants is not upgrade to CC but
> > *downgrade* from CC.
> > 
> > non-CC is the type that is universally applicable, so if we come
> > across a non-CC capable device the proper/optimal thing is to degrade
> > the HWPT and re-use it, not allocate a new HWPT.
> > 
> > So the whole thing is upside down.
> > 
> > As changing the IOPTEs in flight seems hard, and I don't want to see
> > the Intel driver get slowed down to accomodate this, I think you are
> > right to say this should be a creation time property only.
> > 
> > I still think userspace should be able to select it so it can minimize
> > the number of HWPTs required.
> > 
> > > But the user shouldn't assume such explicit consistency since it's not
> > > defined in our uAPI. All we defined is that the attaching may
> > > fail due to incompatibility for whatever reason then the user can
> > > always try creating a new hwpt for the to-be-attached device. From
> > > this regard I don't see providing consistency of result is
> > > necessary. 😊
> > 
> > Anyhow, OK, lets add a comment summarizing your points and remove the
> > cc upgrade at attach time (sorry Nicolin/Yi!)
> 
> Ack. I will send a small removal series. I assume it should CC
> stable tree also? 

No, it seems more like tidying that fixing a functional issue, do I
misunderstand?

> And where should we add this comment? Kdoc of
> the alloc uAPI?

Maybe right in front of the only enforce_cc op callback?

Jason
  
Nicolin Chen Oct. 23, 2023, 12:18 a.m. UTC | #17
On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote:
> > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:
> > > 
> > > > What we want to prevent is attaching a non-CC device to a CC domain
> > > > or upgrade a non-CC domain to CC since in both case the non-CC
> > > > device will be broken due to incompatible page table format.
> > > 
> > > [..]
> > > 
> > > > Who cares about such consistency? sure the result is different due to order:
> > > > 
> > > > 1) creating hwpt for dev1 (non-CC) then later attaching hwpt to
> > > >     dev2 (CC) will succeed;
> > > > 
> > > > 2) creating hwpt for dev2 (CC) then later attaching hwpt to
> > > >     dev1 (non-CC) will fail then the user should create a new hwpt
> > > >     for dev1;
> > > 
> > > AH... So really what the Intel driver wants is not upgrade to CC but
> > > *downgrade* from CC.
> > > 
> > > non-CC is the type that is universally applicable, so if we come
> > > across a non-CC capable device the proper/optimal thing is to degrade
> > > the HWPT and re-use it, not allocate a new HWPT.
> > > 
> > > So the whole thing is upside down.
> > > 
> > > As changing the IOPTEs in flight seems hard, and I don't want to see
> > > the Intel driver get slowed down to accomodate this, I think you are
> > > right to say this should be a creation time property only.
> > > 
> > > I still think userspace should be able to select it so it can minimize
> > > the number of HWPTs required.
> > > 
> > > > But the user shouldn't assume such explicit consistency since it's not
> > > > defined in our uAPI. All we defined is that the attaching may
> > > > fail due to incompatibility for whatever reason then the user can
> > > > always try creating a new hwpt for the to-be-attached device. From
> > > > this regard I don't see providing consistency of result is
> > > > necessary. 😊
> > > 
> > > Anyhow, OK, lets add a comment summarizing your points and remove the
> > > cc upgrade at attach time (sorry Nicolin/Yi!)
> > 
> > Ack. I will send a small removal series. I assume it should CC
> > stable tree also? 
> 
> No, it seems more like tidying that fixing a functional issue, do I
> misunderstand?

Hmm. Maybe the misunderstanding is mine -- Kevin was asking if
it was already a bug and you answered yes:
https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/

If this shouldn't be a bug fix, I could just merge them into a
single tidying patch and add the comments you suggested below.

> > And where should we add this comment? Kdoc of
> > the alloc uAPI?
> 
> Maybe right in front of the only enforce_cc op callback?

OK. How about this? Might be a bit verbose though:
	/*
	 * enforce_cache_coherenc must be determined during the HWPT allocation.
	 * Note that a HWPT (non-CC) created for a device (non-CC) can be later
	 * reused by another device (either non-CC or CC). However, A HWPT (CC)
	 * created for a device (CC) cannot be reused by another device (non-CC)
	 * but only devices (CC). Instead user space in this case would need to
	 * allocate a separate HWPT (non-CC).
	 */

Thanks
Nic
  
Tian, Kevin Oct. 23, 2023, 2:53 a.m. UTC | #18
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, October 23, 2023 8:18 AM
> 
> On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote:
> > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:
> > > >
> > > > > But the user shouldn't assume such explicit consistency since it's not
> > > > > defined in our uAPI. All we defined is that the attaching may
> > > > > fail due to incompatibility for whatever reason then the user can
> > > > > always try creating a new hwpt for the to-be-attached device. From
> > > > > this regard I don't see providing consistency of result is
> > > > > necessary. 😊
> > > >
> > > > Anyhow, OK, lets add a comment summarizing your points and remove
> the
> > > > cc upgrade at attach time (sorry Nicolin/Yi!)
> > >
> > > Ack. I will send a small removal series. I assume it should CC
> > > stable tree also?
> >
> > No, it seems more like tidying that fixing a functional issue, do I
> > misunderstand?
> 
> Hmm. Maybe the misunderstanding is mine -- Kevin was asking if
> it was already a bug and you answered yes:
> https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/
> 

currently intel-iommu driver already rejects 1) enforcing CC on
a domain which is already attached to non-CC device and
2) attaching a non-CC device to a domain which has enforce_cc.

so there is no explorable bug to fix in stable tree.

Just logically intel-iommu driver doesn't support enforcing Cc
on a domain which is capable of enforce-cc and already has
valid mappings. Then it should add proper check to prevent
it from being attempted by future usages.
  
Jason Gunthorpe Oct. 23, 2023, 1:59 p.m. UTC | #19
On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote:

> > > And where should we add this comment? Kdoc of
> > > the alloc uAPI?
> > 
> > Maybe right in front of the only enforce_cc op callback?
> 
> OK. How about this? Might be a bit verbose though:
> 	/*
> 	 * enforce_cache_coherenc must be determined during the HWPT allocation.
> 	 * Note that a HWPT (non-CC) created for a device (non-CC) can be later
> 	 * reused by another device (either non-CC or CC). However, A HWPT (CC)
> 	 * created for a device (CC) cannot be reused by another device (non-CC)
> 	 * but only devices (CC). Instead user space in this case would need to
> 	 * allocate a separate HWPT (non-CC).
> 	 */

Ok, fix the spello in enforce_cache_coherenc

Jason
  
Nicolin Chen Oct. 23, 2023, 6:42 p.m. UTC | #20
On Mon, Oct 23, 2023 at 02:53:20AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Monday, October 23, 2023 8:18 AM
> >
> > On Sat, Oct 21, 2023 at 01:38:04PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 20, 2023 at 11:59:13AM -0700, Nicolin Chen wrote:
> > > > On Fri, Oct 20, 2023 at 10:55:01AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Oct 20, 2023 at 02:43:58AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > But the user shouldn't assume such explicit consistency since it's not
> > > > > > defined in our uAPI. All we defined is that the attaching may
> > > > > > fail due to incompatibility for whatever reason then the user can
> > > > > > always try creating a new hwpt for the to-be-attached device. From
> > > > > > this regard I don't see providing consistency of result is
> > > > > > necessary. 😊
> > > > >
> > > > > Anyhow, OK, lets add a comment summarizing your points and remove
> > the
> > > > > cc upgrade at attach time (sorry Nicolin/Yi!)
> > > >
> > > > Ack. I will send a small removal series. I assume it should CC
> > > > stable tree also?
> > >
> > > No, it seems more like tidying that fixing a functional issue, do I
> > > misunderstand?
> >
> > Hmm. Maybe the misunderstanding is mine -- Kevin was asking if
> > it was already a bug and you answered yes:
> > https://lore.kernel.org/linux-iommu/20231016115736.GP3952@nvidia.com/
> >
> 
> currently intel-iommu driver already rejects 1) enforcing CC on
> a domain which is already attached to non-CC device and
> 2) attaching a non-CC device to a domain which has enforce_cc.
> 
> so there is no explorable bug to fix in stable tree.

I see. Thanks!
  
Nicolin Chen Oct. 23, 2023, 6:49 p.m. UTC | #21
On Mon, Oct 23, 2023 at 10:59:35AM -0300, Jason Gunthorpe wrote:
> On Sun, Oct 22, 2023 at 05:18:03PM -0700, Nicolin Chen wrote:
> 
> > > > And where should we add this comment? Kdoc of
> > > > the alloc uAPI?
> > > 
> > > Maybe right in front of the only enforce_cc op callback?
> > 
> > OK. How about this? Might be a bit verbose though:
> > 	/*
> > 	 * enforce_cache_coherenc must be determined during the HWPT allocation.
> > 	 * Note that a HWPT (non-CC) created for a device (non-CC) can be later
> > 	 * reused by another device (either non-CC or CC). However, A HWPT (CC)
> > 	 * created for a device (CC) cannot be reused by another device (non-CC)
> > 	 * but only devices (CC). Instead user space in this case would need to
> > 	 * allocate a separate HWPT (non-CC).
> > 	 */
> 
> Ok, fix the spello in enforce_cache_coherenc

Oops.

I also found the existing piece sorta says a similar thing:
         * Set the coherency mode before we do iopt_table_add_domain() as some
         * iommus have a per-PTE bit that controls it and need to decide before
         * doing any maps. 

So, did this and sending v3:
-        * enforce_cache_coherenc must be determined during the HWPT allocation.
+        * The cache coherency mode must be configured here and unchanged later.
  

Patch

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index eb120f70a3e3..104dd061a2a3 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -305,12 +305,16 @@  static int iommufd_group_setup_msi(struct iommufd_group *igroup,
 	 * domain after request_irq(). If it is not done interrupts will not
 	 * work on this domain.
 	 *
+	 * Note: always set up a msi_cookie on a kernel-manage hw_pagetable.
+	 *
 	 * FIXME: This is conceptually broken for iommufd since we want to allow
 	 * userspace to change the domains, eg switch from an identity IOAS to a
 	 * DMA IOAS. There is currently no way to create a MSI window that
 	 * matches what the IRQ layer actually expects in a newly created
 	 * domain.
 	 */
+	if (hwpt->user_managed)
+		hwpt = hwpt->parent;
 	if (sw_msi_start != PHYS_ADDR_MAX && !hwpt->msi_cookie) {
 		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
 		if (rc)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index dc3e11a23acf..90fd65859e28 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -152,6 +152,10 @@  iommufd_user_managed_hwpt_alloc(struct iommufd_ctx *ictx,
 
 int iommufd_hw_pagetable_enforce_cc(struct iommufd_hw_pagetable *hwpt)
 {
+	/* Always enforce cache coherency on a kernel-managed hw_pagetable */
+	if (hwpt->user_managed)
+		hwpt = hwpt->parent;
+
 	if (hwpt->enforce_cache_coherency)
 		return 0;