[v2,02/10] iommu: Introduce a new iommu_group_replace_domain() API

Message ID fa81379dca611c1d9f50f9d8cd2bca0d4ec7f965.1675802050.git.nicolinc@nvidia.com
State New
Headers
Series Add IO page table replacement support |

Commit Message

Nicolin Chen Feb. 7, 2023, 9:17 p.m. UTC
  qemu has a need to replace the translations associated with a domain
when the guest does large-scale operations like switching between an
IDENTITY domain and, say, dma-iommu.c.

Currently, it does this by replacing all the mappings in a single
domain, but this is very inefficient and means that domains have to be
per-device rather than per-translation.

Provide a high-level API to allow replacements of one domain with
another. This is similar to a detach/attach cycle except it doesn't
force the group to go to the blocking domain in-between.

By removing this forced blocking domain the iommu driver has the
opportunity to implement an atomic replacement of the domains to the
greatest extent its hardware allows.

It could be possible to adderss this by simply removing the protection
from the iommu_attach_group(), but it is not so clear if that is safe
for the few users. Thus, add a new API to serve this new purpose.

Atomic replacement allows the qemu emulation of the viommu to be more
complete, as real hardware has this ability.

All drivers are already required to support changing between active
UNMANAGED domains when using their attach_dev ops.

This API is expected to be used by IOMMUFD, so add to the iommu-priv
header and mark it as IOMMUFD_INTERNAL.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu-priv.h |  4 ++++
 drivers/iommu/iommu.c      | 28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)
  

Comments

Tian, Kevin Feb. 9, 2023, 2:55 a.m. UTC | #1
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> +int iommu_group_replace_domain(struct iommu_group *group,
> +			       struct iommu_domain *new_domain)
> +{
> +	int ret;
> +
> +	if (!new_domain)
> +		return -EINVAL;

Is there value of allowing NULL new domain so this plays like
iommu_detach_group() then iommufd only needs call one
function in both attach/detach path?

> +
> +	mutex_lock(&group->mutex);
> +	ret = __iommu_group_set_domain(group, new_domain);
> +	if (ret)
> +		__iommu_group_set_domain(group, group->domain);
> +	mutex_unlock(&group->mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain,
> IOMMUFD_INTERNAL);
> +
>  static int iommu_group_do_set_platform_dma(struct device *dev, void
> *data)
>  {
>  	const struct iommu_ops *ops = dev_iommu_ops(dev);
> --
> 2.39.1
>
  
Jason Gunthorpe Feb. 9, 2023, 1:23 p.m. UTC | #2
On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> > 
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > +			       struct iommu_domain *new_domain)
> > +{
> > +	int ret;
> > +
> > +	if (!new_domain)
> > +		return -EINVAL;
> 
> Is there value of allowing NULL new domain so this plays like
> iommu_detach_group() then iommufd only needs call one
> function in both attach/detach path?

We've used NULL to mean the 'platform domain' in the iommu core code
in a few places, I'd prefer to avoid overloading NULL.

IMHO it doesn't help iommufd to substitute detach with replace.

Jason
  
Tian, Kevin Feb. 10, 2023, 1:34 a.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, February 9, 2023 9:23 PM
> 
> On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > +			       struct iommu_domain *new_domain)
> > > +{
> > > +	int ret;
> > > +
> > > +	if (!new_domain)
> > > +		return -EINVAL;
> >
> > Is there value of allowing NULL new domain so this plays like
> > iommu_detach_group() then iommufd only needs call one
> > function in both attach/detach path?
> 
> We've used NULL to mean the 'platform domain' in the iommu core code
> in a few places, I'd prefer to avoid overloading NULL.
> 
> IMHO it doesn't help iommufd to substitute detach with replace.
> 

OK
  
Alex Williamson Feb. 10, 2023, 11:51 p.m. UTC | #4
On Tue, 7 Feb 2023 13:17:54 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> qemu has a need to replace the translations associated with a domain
> when the guest does large-scale operations like switching between an
> IDENTITY domain and, say, dma-iommu.c.
> 
> Currently, it does this by replacing all the mappings in a single
> domain, but this is very inefficient and means that domains have to be
> per-device rather than per-translation.
> 
> Provide a high-level API to allow replacements of one domain with
> another. This is similar to a detach/attach cycle except it doesn't
> force the group to go to the blocking domain in-between.
> 
> By removing this forced blocking domain the iommu driver has the
> opportunity to implement an atomic replacement of the domains to the
> greatest extent its hardware allows.
> 
> It could be possible to adderss this by simply removing the protection
> from the iommu_attach_group(), but it is not so clear if that is safe
> for the few users. Thus, add a new API to serve this new purpose.
> 
> Atomic replacement allows the qemu emulation of the viommu to be more
> complete, as real hardware has this ability.

I was under the impression that we could not atomically switch a
device's domain relative to in-flight DMA.  IIRC, the discussion was
relative to VT-d, and I vaguely recall something about the domain
needing to be invalidated before it could be replaced.  Am I
mis-remembering or has this since been solved?  Adding Ashok, who might
have been involved in one of those conversations.

Or maybe atomic is the wrong word here since we expect no in-flight DMA
during the sort of mode transitions referred to here, and we're really
just trying to convey that we can do this via a single operation with
reduced latency?  Thanks,

Alex

> All drivers are already required to support changing between active
> UNMANAGED domains when using their attach_dev ops.
> 
> This API is expected to be used by IOMMUFD, so add to the iommu-priv
> header and mark it as IOMMUFD_INTERNAL.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommu-priv.h |  4 ++++
>  drivers/iommu/iommu.c      | 28 ++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
> index 9e1497027cff..b546795a7e49 100644
> --- a/drivers/iommu/iommu-priv.h
> +++ b/drivers/iommu/iommu-priv.h
> @@ -15,4 +15,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
>  	 */
>  	return dev->iommu->iommu_dev->ops;
>  }
> +
> +extern int iommu_group_replace_domain(struct iommu_group *group,
> +				      struct iommu_domain *new_domain);
> +
>  #endif /* __LINUX_IOMMU_PRIV_H */
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a18b7f1a4e6e..15e07d39cd8d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2151,6 +2151,34 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_group);
>  
> +/**
> + * iommu_group_replace_domain - replace the domain that a group is attached to
> + * @new_domain: new IOMMU domain to replace with
> + * @group: IOMMU group that will be attached to the new domain
> + *
> + * This API allows the group to switch domains without being forced to go to
> + * the blocking domain in-between.
> + *
> + * If the currently attached domain is a core domain (e.g. a default_domain),
> + * it will act just like the iommu_attach_group().
> + */
> +int iommu_group_replace_domain(struct iommu_group *group,
> +			       struct iommu_domain *new_domain)
> +{
> +	int ret;
> +
> +	if (!new_domain)
> +		return -EINVAL;
> +
> +	mutex_lock(&group->mutex);
> +	ret = __iommu_group_set_domain(group, new_domain);
> +	if (ret)
> +		__iommu_group_set_domain(group, group->domain);
> +	mutex_unlock(&group->mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
> +
>  static int iommu_group_do_set_platform_dma(struct device *dev, void *data)
>  {
>  	const struct iommu_ops *ops = dev_iommu_ops(dev);
  
Jason Gunthorpe Feb. 11, 2023, 12:44 a.m. UTC | #5
On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
> On Tue, 7 Feb 2023 13:17:54 -0800
> Nicolin Chen <nicolinc@nvidia.com> wrote:
> 
> > qemu has a need to replace the translations associated with a domain
> > when the guest does large-scale operations like switching between an
> > IDENTITY domain and, say, dma-iommu.c.
> > 
> > Currently, it does this by replacing all the mappings in a single
> > domain, but this is very inefficient and means that domains have to be
> > per-device rather than per-translation.
> > 
> > Provide a high-level API to allow replacements of one domain with
> > another. This is similar to a detach/attach cycle except it doesn't
> > force the group to go to the blocking domain in-between.
> > 
> > By removing this forced blocking domain the iommu driver has the
> > opportunity to implement an atomic replacement of the domains to the
> > greatest extent its hardware allows.
> > 
> > It could be possible to adderss this by simply removing the protection
> > from the iommu_attach_group(), but it is not so clear if that is safe
> > for the few users. Thus, add a new API to serve this new purpose.
> > 
> > Atomic replacement allows the qemu emulation of the viommu to be more
> > complete, as real hardware has this ability.
> 
> I was under the impression that we could not atomically switch a
> device's domain relative to in-flight DMA.  

Certainly all the HW can be proper atomic but not necessarily easily -
the usual issue is a SW complication to manage the software controlled
cache tags in a way that doesn't corrupt the cache.

This is because the cache tag and the io page table top are in
different 64 bit words so atomic writes don't cover both, and thus the
IOMMU HW could tear the two stores and mismatch the cache tag to the
table top. This would corrupt the cache.

The easiest way to avoid this is for SW to use the same DID for the
new and old tables. This is possible if this translation entry is the
only user of the DID. A more complex way would use a temporary DID
that can be safely corrupted. But realistically I'd expect VT-d
drivers to simply make the PASID invalid for the duration of the
update.

However something like AMD has a single cache tag for every entry in
the PASID table so you could do an atomic replace trivially. Just
update the PASID and invalidate the caches.

ARM has a flexible PASID table and atomic replace would be part of
resizing it. eg you can atomically update the top of the PASID table
with a single 64 bit store.

So replace lets the drivers implement those special behaviors if it
makes sense for them.

> Or maybe atomic is the wrong word here since we expect no in-flight DMA
> during the sort of mode transitions referred to here, and we're really
> just trying to convey that we can do this via a single operation with
> reduced latency?  Thanks,

atomic means DMA will either translate with the old domain or the new
domain but never a blocking domain. Keep in mind that with nesting
"domain" can mean a full PASID table in guest memory.

I should reiterate that replace is not an API that is required to be
atomic.

Jason
  
Tian, Kevin Feb. 13, 2023, 2:24 a.m. UTC | #6
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, February 11, 2023 8:45 AM
> 
> On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
> > On Tue, 7 Feb 2023 13:17:54 -0800
> > Nicolin Chen <nicolinc@nvidia.com> wrote:
> >
> > > qemu has a need to replace the translations associated with a domain
> > > when the guest does large-scale operations like switching between an
> > > IDENTITY domain and, say, dma-iommu.c.
> > >
> > > Currently, it does this by replacing all the mappings in a single
> > > domain, but this is very inefficient and means that domains have to be
> > > per-device rather than per-translation.
> > >
> > > Provide a high-level API to allow replacements of one domain with
> > > another. This is similar to a detach/attach cycle except it doesn't
> > > force the group to go to the blocking domain in-between.
> > >
> > > By removing this forced blocking domain the iommu driver has the
> > > opportunity to implement an atomic replacement of the domains to the
> > > greatest extent its hardware allows.
> > >
> > > It could be possible to adderss this by simply removing the protection
> > > from the iommu_attach_group(), but it is not so clear if that is safe
> > > for the few users. Thus, add a new API to serve this new purpose.
> > >
> > > Atomic replacement allows the qemu emulation of the viommu to be
> more
> > > complete, as real hardware has this ability.
> >
> > I was under the impression that we could not atomically switch a
> > device's domain relative to in-flight DMA.

it's possible as long as the mappings for in-flight DMA don't change
in the transition.

> 
> Certainly all the HW can be proper atomic but not necessarily easily -
> the usual issue is a SW complication to manage the software controlled
> cache tags in a way that doesn't corrupt the cache.
> 
> This is because the cache tag and the io page table top are in
> different 64 bit words so atomic writes don't cover both, and thus the
> IOMMU HW could tear the two stores and mismatch the cache tag to the
> table top. This would corrupt the cache.

VT-d spec recommends using 128bit cmpxchg instruction to update
page table pointer and DID together.

> 
> The easiest way to avoid this is for SW to use the same DID for the
> new and old tables. This is possible if this translation entry is the
> only user of the DID. A more complex way would use a temporary DID
> that can be safely corrupted. But realistically I'd expect VT-d
> drivers to simply make the PASID invalid for the duration of the
> update.
> 
> However something like AMD has a single cache tag for every entry in
> the PASID table so you could do an atomic replace trivially. Just
> update the PASID and invalidate the caches.
> 
> ARM has a flexible PASID table and atomic replace would be part of
> resizing it. eg you can atomically update the top of the PASID table
> with a single 64 bit store.
> 
> So replace lets the drivers implement those special behaviors if it
> makes sense for them.
> 
> > Or maybe atomic is the wrong word here since we expect no in-flight DMA
> > during the sort of mode transitions referred to here, and we're really
> > just trying to convey that we can do this via a single operation with
> > reduced latency?  Thanks,
> 
> atomic means DMA will either translate with the old domain or the new
> domain but never a blocking domain. Keep in mind that with nesting
> "domain" can mean a full PASID table in guest memory.
> 
> I should reiterate that replace is not an API that is required to be
> atomic.
> 

yes. Just as explained in the commit message:

"
  By removing this forced blocking domain the iommu driver has the
  opportunity to implement an atomic replacement of the domains to the
  greatest extent its hardware allows.
"

API level replace only implies removing transition to/from blocking
domain. and then it's driver's call whether it wants to take this
chance to implement a true 'atomic' behavior.
  
Baolu Lu Feb. 13, 2023, 8:34 a.m. UTC | #7
On 2023/2/13 10:24, Tian, Kevin wrote:
>> From: Jason Gunthorpe<jgg@nvidia.com>
>> Sent: Saturday, February 11, 2023 8:45 AM
>>
>> On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
>>> On Tue, 7 Feb 2023 13:17:54 -0800
>>> Nicolin Chen<nicolinc@nvidia.com>  wrote:
>>>
>>>> qemu has a need to replace the translations associated with a domain
>>>> when the guest does large-scale operations like switching between an
>>>> IDENTITY domain and, say, dma-iommu.c.
>>>>
>>>> Currently, it does this by replacing all the mappings in a single
>>>> domain, but this is very inefficient and means that domains have to be
>>>> per-device rather than per-translation.
>>>>
>>>> Provide a high-level API to allow replacements of one domain with
>>>> another. This is similar to a detach/attach cycle except it doesn't
>>>> force the group to go to the blocking domain in-between.
>>>>
>>>> By removing this forced blocking domain the iommu driver has the
>>>> opportunity to implement an atomic replacement of the domains to the
>>>> greatest extent its hardware allows.
>>>>
>>>> It could be possible to adderss this by simply removing the protection
>>>> from the iommu_attach_group(), but it is not so clear if that is safe
>>>> for the few users. Thus, add a new API to serve this new purpose.
>>>>
>>>> Atomic replacement allows the qemu emulation of the viommu to be
>> more
>>>> complete, as real hardware has this ability.
>>> I was under the impression that we could not atomically switch a
>>> device's domain relative to in-flight DMA.
> it's possible as long as the mappings for in-flight DMA don't change
> in the transition.
> 

It also requires the mappings in old and new domains are identical. In
another word, any IOVA should be translated to a same result no matter
through the old domain, the new domain, or hardware caches.

A similar replacement has been implemented in the code. For example,
the Intel IOMMU driver dynamically transforms a large range (2M or 1G)
mapping from discrete 4k pages to a super pages to improve the paging
cache efficiency.

Best regards,
baolu
  
Jason Gunthorpe Feb. 13, 2023, 2:45 p.m. UTC | #8
On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:

> > This is because the cache tag and the io page table top are in
> > different 64 bit words so atomic writes don't cover both, and thus the
> > IOMMU HW could tear the two stores and mismatch the cache tag to the
> > table top. This would corrupt the cache.
> 
> VT-d spec recommends using 128bit cmpxchg instruction to update
> page table pointer and DID together.

Oh really? Such a thing exists? That neat!

Jason
  
Tian, Kevin Feb. 14, 2023, 3:29 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, February 13, 2023 10:46 PM
> 
> On Mon, Feb 13, 2023 at 02:24:40AM +0000, Tian, Kevin wrote:
> 
> > > This is because the cache tag and the io page table top are in
> > > different 64 bit words so atomic writes don't cover both, and thus the
> > > IOMMU HW could tear the two stores and mismatch the cache tag to the
> > > table top. This would corrupt the cache.
> >
> > VT-d spec recommends using 128bit cmpxchg instruction to update
> > page table pointer and DID together.
> 
> Oh really? Such a thing exists? That neat!
> 

yes. see cmpxchg_double().
  
Tian, Kevin Feb. 15, 2023, 6:10 a.m. UTC | #10
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> +int iommu_group_replace_domain(struct iommu_group *group,
> +			       struct iommu_domain *new_domain)
> +{
> +	int ret;
> +
> +	if (!new_domain)
> +		return -EINVAL;
> +
> +	mutex_lock(&group->mutex);
> +	ret = __iommu_group_set_domain(group, new_domain);
> +	if (ret)
> +		__iommu_group_set_domain(group, group->domain);

Just realize the error unwind is a nop given below:

__iommu_group_set_domain()
{
	if (group->domain == new_domain)
		return 0;

	...

There was an attempt [1] to fix error unwind in iommu_attach_group(), by
temporarily set group->domain to NULL before calling set_domain().

Jason, I wonder why this recovering cannot be done in
__iommu_group_set_domain() directly, e.g.:

	ret = __iommu_group_for_each_dev(group, new_domain,
					  iommu_group_do_attach_device);
	if (ret) {
		__iommu_group_for_each_dev(group, group->domain,
					  iommu_group_do_attach_device);
		return ret;
	}
	group->domain = new_domain;

Thanks,
Kevin

[1] https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/
  
Jason Gunthorpe Feb. 15, 2023, 12:52 p.m. UTC | #11
On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> > 
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > +			       struct iommu_domain *new_domain)
> > +{
> > +	int ret;
> > +
> > +	if (!new_domain)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&group->mutex);
> > +	ret = __iommu_group_set_domain(group, new_domain);
> > +	if (ret)
> > +		__iommu_group_set_domain(group, group->domain);
> 
> Just realize the error unwind is a nop given below:
> 
> __iommu_group_set_domain()
> {
> 	if (group->domain == new_domain)
> 		return 0;
> 
> 	...
> 
> There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> temporarily set group->domain to NULL before calling set_domain().
> 
> Jason, I wonder why this recovering cannot be done in
> __iommu_group_set_domain() directly, e.g.:
> 
> 	ret = __iommu_group_for_each_dev(group, new_domain,
> 					  iommu_group_do_attach_device);
> 	if (ret) {
> 		__iommu_group_for_each_dev(group, group->domain,
> 					  iommu_group_do_attach_device);
> 		return ret;
> 	}
> 	group->domain = new_domain;

We talked about this already, some times this is not the correct
recovery case, eg if we are going to a blocking domain we need to drop
all references to the prior domain, not put them back.

Failures are WARN_ON events not error recovery.

Jason
  
Tian, Kevin Feb. 22, 2023, 2:11 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 15, 2023 8:53 PM
> 
> On Wed, Feb 15, 2023 at 06:10:47AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > +			       struct iommu_domain *new_domain)
> > > +{
> > > +	int ret;
> > > +
> > > +	if (!new_domain)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&group->mutex);
> > > +	ret = __iommu_group_set_domain(group, new_domain);
> > > +	if (ret)
> > > +		__iommu_group_set_domain(group, group->domain);
> >
> > Just realize the error unwind is a nop given below:
> >
> > __iommu_group_set_domain()
> > {
> > 	if (group->domain == new_domain)
> > 		return 0;
> >
> > 	...
> >
> > There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> > temporarily set group->domain to NULL before calling set_domain().
> >
> > Jason, I wonder why this recovering cannot be done in
> > __iommu_group_set_domain() directly, e.g.:
> >
> > 	ret = __iommu_group_for_each_dev(group, new_domain,
> > 					  iommu_group_do_attach_device);
> > 	if (ret) {
> > 		__iommu_group_for_each_dev(group, group->domain,
> > 					  iommu_group_do_attach_device);
> > 		return ret;
> > 	}
> > 	group->domain = new_domain;
> 
> We talked about this already, some times this is not the correct
> recovery case, eg if we are going to a blocking domain we need to drop
> all references to the prior domain, not put them back.
> 
> Failures are WARN_ON events not error recovery.
> 

OK, I remember that. Then here looks we also need temporarily
set group->domain to NULL before calling set_domain() to recover,
as [1] does.

[1] https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/
  
Jason Gunthorpe Feb. 24, 2023, 12:57 a.m. UTC | #13
On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:

> > > There was an attempt [1] to fix error unwind in iommu_attach_group(), by
> > > temporarily set group->domain to NULL before calling set_domain().
> > >
> > > Jason, I wonder why this recovering cannot be done in
> > > __iommu_group_set_domain() directly, e.g.:
> > >
> > > 	ret = __iommu_group_for_each_dev(group, new_domain,
> > > 					  iommu_group_do_attach_device);
> > > 	if (ret) {
> > > 		__iommu_group_for_each_dev(group, group->domain,
> > > 					  iommu_group_do_attach_device);
> > > 		return ret;
> > > 	}
> > > 	group->domain = new_domain;
> > 
> > We talked about this already, some times this is not the correct
> > recovery case, eg if we are going to a blocking domain we need to drop
> > all references to the prior domain, not put them back.
> > 
> > Failures are WARN_ON events not error recovery.
> > 
> 
> OK, I remember that. Then here looks we also need temporarily
> set group->domain to NULL before calling set_domain() to recover,
> as [1] does.

Sigh, this is too much.

I made a series to clean up all the domain attach logic so the error
handling is all in one place and all the same.

What do you think?

https://github.com/jgunthorpe/linux/commits/iommufd_hwpt

Jason
  
Tian, Kevin Feb. 24, 2023, 8:07 a.m. UTC | #14
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 24, 2023 8:58 AM
> 
> On Wed, Feb 22, 2023 at 02:11:39AM +0000, Tian, Kevin wrote:
> 
> > > > There was an attempt [1] to fix error unwind in iommu_attach_group(),
> by
> > > > temporarily set group->domain to NULL before calling set_domain().
> > > >
> > > > Jason, I wonder why this recovering cannot be done in
> > > > __iommu_group_set_domain() directly, e.g.:
> > > >
> > > > 	ret = __iommu_group_for_each_dev(group, new_domain,
> > > > 					  iommu_group_do_attach_device);
> > > > 	if (ret) {
> > > > 		__iommu_group_for_each_dev(group, group->domain,
> > > > 					  iommu_group_do_attach_device);
> > > > 		return ret;
> > > > 	}
> > > > 	group->domain = new_domain;
> > >
> > > We talked about this already, some times this is not the correct
> > > recovery case, eg if we are going to a blocking domain we need to drop
> > > all references to the prior domain, not put them back.
> > >
> > > Failures are WARN_ON events not error recovery.
> > >
> >
> > OK, I remember that. Then here looks we also need temporarily
> > set group->domain to NULL before calling set_domain() to recover,
> > as [1] does.
> 
> Sigh, this is too much.
> 
> I made a series to clean up all the domain attach logic so the error
> handling is all in one place and all the same.
> 
> What do you think?
> 
> https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
> 

Yeah, that sounds a right cleanup at a glance.
  

Patch

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index 9e1497027cff..b546795a7e49 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -15,4 +15,8 @@  static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 	 */
 	return dev->iommu->iommu_dev->ops;
 }
+
+extern int iommu_group_replace_domain(struct iommu_group *group,
+				      struct iommu_domain *new_domain);
+
 #endif /* __LINUX_IOMMU_PRIV_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a18b7f1a4e6e..15e07d39cd8d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2151,6 +2151,34 @@  int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_group);
 
+/**
+ * iommu_group_replace_domain - replace the domain that a group is attached to
+ * @new_domain: new IOMMU domain to replace with
+ * @group: IOMMU group that will be attached to the new domain
+ *
+ * This API allows the group to switch domains without being forced to go to
+ * the blocking domain in-between.
+ *
+ * If the currently attached domain is a core domain (e.g. a default_domain),
+ * it will act just like the iommu_attach_group().
+ */
+int iommu_group_replace_domain(struct iommu_group *group,
+			       struct iommu_domain *new_domain)
+{
+	int ret;
+
+	if (!new_domain)
+		return -EINVAL;
+
+	mutex_lock(&group->mutex);
+	ret = __iommu_group_set_domain(group, new_domain);
+	if (ret)
+		__iommu_group_set_domain(group, group->domain);
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain, IOMMUFD_INTERNAL);
+
 static int iommu_group_do_set_platform_dma(struct device *dev, void *data)
 {
 	const struct iommu_ops *ops = dev_iommu_ops(dev);