[v6,07/10] iommufd: Add a nested HW pagetable object

Message ID 20231024150609.46884-8-yi.l.liu@intel.com
State New
Headers
Series iommufd: Add nesting infrastructure (part 1/2) |

Commit Message

Yi Liu Oct. 24, 2023, 3:06 p.m. UTC
  From: Nicolin Chen <nicolinc@nvidia.com>

IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate a hw_pagetable that associates to a given IOAS,
i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type.

IOMMU drivers can now support user-managed hw_pagetables, for two-stage
translation use cases that require user data input from the user space.

Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it
with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper.
Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be
handled in the callers, for example iommufd_hw_pagetable_enforce_rr().

Screen the inputs including the parent PAGING-type hw_pagetable that has
a need of a new nest_parent flag in the iommufd_hwpt_paging structure.

Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data
input. Also, update the @pt_id to accept hwpt_id too besides an ioas_id.
Then, use them to allocate a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type
using the iommufd_hw_pagetable_alloc_nested() allocator.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c          |   1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 105 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  29 +++++--
 drivers/iommu/iommufd/main.c            |   4 +
 include/uapi/linux/iommufd.h            |  23 +++++-
 5 files changed, 154 insertions(+), 8 deletions(-)
  

Comments

Jason Gunthorpe Oct. 24, 2023, 5:13 p.m. UTC | #1
On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> +static struct iommufd_hwpt_nested *
> +iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
> +			  struct iommufd_hwpt_paging *parent,
> +			  struct iommufd_device *idev, u32 flags,
> +			  const struct iommu_user_data *user_data)
> +{
> +	const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
> +	struct iommufd_hwpt_nested *hwpt_nested;
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc;
> +
> +	if (flags != 0)
> +		return ERR_PTR(-EOPNOTSUPP);
> +	if (!user_data)
> +		return ERR_PTR(-EINVAL);

user_data is never NULL now, I removed this

> +	if (user_data->type == IOMMU_HWPT_DATA_NONE)
> +		return ERR_PTR(-EINVAL);
> +	if (parent->auto_domain)
> +		return ERR_PTR(-EINVAL);
> +	if (!parent->nest_parent)
> +		return ERR_PTR(-EINVAL);

And put these all together

	if (flags || user_data->type == IOMMU_HWPT_DATA_NONE ||
	    !ops->domain_alloc_user)
		return ERR_PTR(-EOPNOTSUPP);
	if (parent->auto_domain || !parent->nest_parent)
		return ERR_PTR(-EINVAL);


> +
> +	if (!ops->domain_alloc_user)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	hwpt_nested = __iommufd_object_alloc(ictx, hwpt_nested,
> +					     IOMMUFD_OBJ_HWPT_NESTED,
> +					     common.obj);
> +	if (IS_ERR(hwpt_nested))
> +		return ERR_CAST(hwpt_nested);
> +	hwpt = &hwpt_nested->common;
> +
> +	refcount_inc(&parent->common.obj.users);
> +	hwpt_nested->parent = parent;
> +
> +	hwpt->domain = ops->domain_alloc_user(idev->dev, 0,

And we may as well pass flags here even though we know it is 0 today.

Jason
  
Jason Gunthorpe Oct. 24, 2023, 5:18 p.m. UTC | #2
On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>  	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
>  		struct iommufd_hwpt_paging *hwpt_paging;
>  
> +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> +			rc = -EINVAL;
> +			goto out_put_pt;
> +		}
>  		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
>  		mutex_lock(&ioas->mutex);
>  		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,

?? What is this?

Ah something went wrong earlier in "iommu: Pass in parent domain with
user_data to domain_alloc_user op"

Once we added the user_data we should flow it through to the op
always.

Jason
  
Nicolin Chen Oct. 24, 2023, 5:28 p.m. UTC | #3
On Tue, Oct 24, 2023 at 02:18:10PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> > @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> >  	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
> >  		struct iommufd_hwpt_paging *hwpt_paging;
> >  
> > +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> >  		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> >  		mutex_lock(&ioas->mutex);
> >  		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
> 
> ?? What is this?
> 
> Ah something went wrong earlier in "iommu: Pass in parent domain with
> user_data to domain_alloc_user op"
> 
> Once we added the user_data we should flow it through to the op
> always.

Hmm, iommufd_hwpt_paging_alloc doesn't take (or need) user_data,
but we could pass in a dummy one if that looks better?

Thanks
Nicolin
  
Jason Gunthorpe Oct. 24, 2023, 5:30 p.m. UTC | #4
On Tue, Oct 24, 2023 at 02:18:10PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> > @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> >  	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
> >  		struct iommufd_hwpt_paging *hwpt_paging;
> >  
> > +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> > +			rc = -EINVAL;
> > +			goto out_put_pt;
> > +		}
> >  		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> >  		mutex_lock(&ioas->mutex);
> >  		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
> 
> ?? What is this?
> 
> Ah something went wrong earlier in "iommu: Pass in parent domain with
> user_data to domain_alloc_user op"

Bah, I got confused because that had half the uapi, so in this pathc
 
> Once we added the user_data we should flow it through to the op
> always.

Like this:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 92859b907bb93c..a3382811af8a81 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -586,8 +586,8 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
 		goto out_unlock;
 	}
 
-	hwpt_paging = iommufd_hwpt_paging_alloc(idev->ictx, ioas, idev,
-						 0, immediate_attach);
+	hwpt_paging = iommufd_hwpt_paging_alloc(idev->ictx, ioas, idev, 0,
+						immediate_attach, NULL);
 	if (IS_ERR(hwpt_paging)) {
 		destroy_hwpt = ERR_CAST(hwpt_paging);
 		goto out_unlock;
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index cfd85df693d7b2..324a6d73f032ee 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -96,7 +96,8 @@ iommufd_hwpt_paging_enforce_cc(struct iommufd_hwpt_paging *hwpt_paging)
 struct iommufd_hwpt_paging *
 iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			  struct iommufd_device *idev, u32 flags,
-			  bool immediate_attach)
+			  bool immediate_attach,
+			  const struct iommu_user_data *user_data)
 {
 	const u32 valid_flags = IOMMU_HWPT_ALLOC_NEST_PARENT |
 				IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
@@ -107,7 +108,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 
 	lockdep_assert_held(&ioas->mutex);
 
-	if (flags && !ops->domain_alloc_user)
+	if ((flags || user_data) && !ops->domain_alloc_user)
 		return ERR_PTR(-EOPNOTSUPP);
 	if (flags & ~valid_flags)
 		return ERR_PTR(-EOPNOTSUPP);
@@ -127,7 +128,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 
 	if (ops->domain_alloc_user) {
 		hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
-						      NULL, NULL);
+						      NULL, user_data);
 		if (IS_ERR(hwpt->domain)) {
 			rc = PTR_ERR(hwpt->domain);
 			hwpt->domain = NULL;
@@ -210,8 +211,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
 	struct iommufd_hw_pagetable *hwpt;
 	int rc;
 
-	if (flags || user_data->type == IOMMU_HWPT_DATA_NONE ||
-	    !ops->domain_alloc_user)
+	if (flags || !user_data->len || !ops->domain_alloc_user)
 		return ERR_PTR(-EOPNOTSUPP);
 	if (parent->auto_domain || !parent->nest_parent)
 		return ERR_PTR(-EINVAL);
@@ -249,6 +249,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_hwpt_alloc *cmd = ucmd->cmd;
+	const struct iommu_user_data user_data = {
+		.type = cmd->data_type,
+		.uptr = u64_to_user_ptr(cmd->data_uptr),
+		.len = cmd->data_len,
+	};
 	struct iommufd_hw_pagetable *hwpt;
 	struct iommufd_ioas *ioas = NULL;
 	struct iommufd_object *pt_obj;
@@ -273,25 +278,17 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
 		struct iommufd_hwpt_paging *hwpt_paging;
 
-		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
-			rc = -EINVAL;
-			goto out_put_pt;
-		}
 		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
 		mutex_lock(&ioas->mutex);
-		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
-							cmd->flags, false);
+		hwpt_paging = iommufd_hwpt_paging_alloc(
+			ucmd->ictx, ioas, idev, cmd->flags, false,
+			user_data.len ? &user_data : NULL);
 		if (IS_ERR(hwpt_paging)) {
 			rc = PTR_ERR(hwpt_paging);
 			goto out_unlock;
 		}
 		hwpt = &hwpt_paging->common;
 	} else if (pt_obj->type == IOMMUFD_OBJ_HWPT_PAGING) {
-		const struct iommu_user_data user_data = {
-			.type = cmd->data_type,
-			.uptr = u64_to_user_ptr(cmd->data_uptr),
-			.len = cmd->data_len,
-		};
 		struct iommufd_hwpt_nested *hwpt_nested;
 		struct iommufd_hwpt_paging *parent;
 
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6fdad56893af4d..24e5a36fc875e0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -290,7 +290,8 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd);
 struct iommufd_hwpt_paging *
 iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			  struct iommufd_device *idev, u32 flags,
-			  bool immediate_attach);
+			  bool immediate_attach,
+			  const struct iommu_user_data *user_data);
 int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 				struct iommufd_device *idev);
 struct iommufd_hw_pagetable *
  
Jason Gunthorpe Oct. 24, 2023, 5:31 p.m. UTC | #5
On Tue, Oct 24, 2023 at 10:28:45AM -0700, Nicolin Chen wrote:
> On Tue, Oct 24, 2023 at 02:18:10PM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> > > @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> > >  	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
> > >  		struct iommufd_hwpt_paging *hwpt_paging;
> > >  
> > > +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> > > +			rc = -EINVAL;
> > > +			goto out_put_pt;
> > > +		}
> > >  		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > >  		mutex_lock(&ioas->mutex);
> > >  		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
> > 
> > ?? What is this?
> > 
> > Ah something went wrong earlier in "iommu: Pass in parent domain with
> > user_data to domain_alloc_user op"
> > 
> > Once we added the user_data we should flow it through to the op
> > always.
> 
> Hmm, iommufd_hwpt_paging_alloc doesn't take (or need) user_data,
> but we could pass in a dummy one if that looks better?

The point is for the user_data to always be available, the driver
needs to check it if it is passed.

This should all be plumbed to allow drivers to also customize their
paging domains too.

Jason
  
Jason Gunthorpe Oct. 24, 2023, 5:37 p.m. UTC | #6
On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
>  static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
>  					    struct iommufd_hw_pagetable *hwpt)
>  {
> -	struct iommufd_hwpt_paging *hwpt_paging = to_hwpt_paging(hwpt);
> +	if (WARN_ON(hwpt->obj.type != IOMMUFD_OBJ_HWPT_PAGING &&
> +		    hwpt->obj.type != IOMMUFD_OBJ_HWPT_NESTED))
> +		return;

This is redundant, we have a C type, no need to check the type field
like this

Jason
  
Nicolin Chen Oct. 24, 2023, 5:50 p.m. UTC | #7
On Tue, Oct 24, 2023 at 02:31:39PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 10:28:45AM -0700, Nicolin Chen wrote:
> > On Tue, Oct 24, 2023 at 02:18:10PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
> > > > @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> > > >  	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
> > > >  		struct iommufd_hwpt_paging *hwpt_paging;
> > > >  
> > > > +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> > > > +			rc = -EINVAL;
> > > > +			goto out_put_pt;
> > > > +		}
> > > >  		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
> > > >  		mutex_lock(&ioas->mutex);
> > > >  		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
> > > 
> > > ?? What is this?
> > > 
> > > Ah something went wrong earlier in "iommu: Pass in parent domain with
> > > user_data to domain_alloc_user op"
> > > 
> > > Once we added the user_data we should flow it through to the op
> > > always.
> > 
> > Hmm, iommufd_hwpt_paging_alloc doesn't take (or need) user_data,
> > but we could pass in a dummy one if that looks better?
> 
> The point is for the user_data to always be available, the driver
> needs to check it if it is passed.
> 
> This should all be plumbed to allow drivers to also customize their
> paging domains too.

We don't have a use case of customizing the paging domains.
And our selftest isn't covering this path. Nor the case is
supported by the uAPI:

458- * A kernel-managed HWPT will be created with the mappings from the given
459- * IOAS via the @pt_id. The @data_type for this allocation must be set to
460: * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
461- * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
462- *


Also, if we do passing in the data, we'd need to...

280-static struct iommu_domain *
281-mock_domain_alloc_user(struct device *dev, u32 flags,
282-		       struct iommu_domain *parent,
283:		       const struct iommu_user_data *user_data)
284-{
285-	struct mock_iommu_domain *mock_parent;
286-	struct iommu_hwpt_selftest user_cfg;
287-	int rc;
288-
289:	if (!user_data) {	/* must be mock_domain */

...change this to if (!parent)...

290-		struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
291-		bool has_dirty_flag = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
292-		bool no_dirty_ops = mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY;
293-
294-		if (parent)
295-			return ERR_PTR(-EINVAL);

...and drop this.

296-		if (has_dirty_flag && no_dirty_ops)
297-			return ERR_PTR(-EOPNOTSUPP);
298-		return __mock_domain_alloc_paging(IOMMU_DOMAIN_UNMANAGED,
299-						  has_dirty_flag);
300-	}
301-
302-	/* must be mock_domain_nested */
303:	if (user_data->type != IOMMU_HWPT_DATA_SELFTEST)
304-		return ERR_PTR(-EOPNOTSUPP);

Thanks
Nicolin
  
Jason Gunthorpe Oct. 24, 2023, 6 p.m. UTC | #8
On Tue, Oct 24, 2023 at 10:50:58AM -0700, Nicolin Chen wrote:

> > The point is for the user_data to always be available, the driver
> > needs to check it if it is passed.
> > 
> > This should all be plumbed to allow drivers to also customize their
> > paging domains too.
> 
> We don't have a use case of customizing the paging domains.
> And our selftest isn't covering this path. Nor the case is
> supported by the uAPI:

But this is the design, it is why everything is setup like this - we
didn't create a new op to allocate nesting domains, we made a flexible
user allocator.

> 458- * A kernel-managed HWPT will be created with the mappings from the given
> 459- * IOAS via the @pt_id. The @data_type for this allocation must be set to
> 460: * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
> 461- * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
> 462- *

Yes, that is the reality today. If someone comes to use the more
complete interface they need to fix that comment..

> Also, if we do passing in the data, we'd need to...
 
> 280-static struct iommu_domain *
> 281-mock_domain_alloc_user(struct device *dev, u32 flags,
> 282-		       struct iommu_domain *parent,
> 283:		       const struct iommu_user_data *user_data)
> 284-{
> 285-	struct mock_iommu_domain *mock_parent;
> 286-	struct iommu_hwpt_selftest user_cfg;
> 287-	int rc;
> 288-
> 289:	if (!user_data) {	/* must be mock_domain */
>
> ...change this to if (!parent)...

Yes, this logic is not ideal. The parent is the request for nesting,
not user_data. user_data is the generic creation parameters, which are
not supported outside nesting
 
Like this:

--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -286,14 +286,12 @@ mock_domain_alloc_user(struct device *dev, u32 flags,
        int rc;
 
        /* must be mock_domain */
-       if (!user_data) {
+       if (!parent) {
                struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
                bool has_dirty_flag = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
                bool no_dirty_ops = mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY;
 
-               if (parent)
-                       return ERR_PTR(-EINVAL);
-               if (has_dirty_flag && no_dirty_ops)
+               if (user_data || (has_dirty_flag && no_dirty_ops))
                        return ERR_PTR(-EOPNOTSUPP);
                return __mock_domain_alloc_paging(IOMMU_DOMAIN_UNMANAGED,
                                                  has_dirty_flag);

Thanks,
Jason
  
Nicolin Chen Oct. 24, 2023, 6:19 p.m. UTC | #9
On Tue, Oct 24, 2023 at 03:00:49PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 10:50:58AM -0700, Nicolin Chen wrote:
> 
> > > The point is for the user_data to always be available, the driver
> > > needs to check it if it is passed.
> > > 
> > > This should all be plumbed to allow drivers to also customize their
> > > paging domains too.
> > 
> > We don't have a use case of customizing the paging domains.
> > And our selftest isn't covering this path. Nor the case is
> > supported by the uAPI:
> 
> But this is the design, it is why everything is setup like this - we
> didn't create a new op to allocate nesting domains, we made a flexible
> user allocator.
> 
> > 458- * A kernel-managed HWPT will be created with the mappings from the given
> > 459- * IOAS via the @pt_id. The @data_type for this allocation must be set to
> > 460: * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
> > 461- * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
> > 462- *
> 
> Yes, that is the reality today. If someone comes to use the more
> complete interface they need to fix that comment..

Ack.

> > Also, if we do passing in the data, we'd need to...
>  
> > 280-static struct iommu_domain *
> > 281-mock_domain_alloc_user(struct device *dev, u32 flags,
> > 282-		       struct iommu_domain *parent,
> > 283:		       const struct iommu_user_data *user_data)
> > 284-{
> > 285-	struct mock_iommu_domain *mock_parent;
> > 286-	struct iommu_hwpt_selftest user_cfg;
> > 287-	int rc;
> > 288-
> > 289:	if (!user_data) {	/* must be mock_domain */
> >
> > ...change this to if (!parent)...
> 
> Yes, this logic is not ideal. The parent is the request for nesting,
> not user_data. user_data is the generic creation parameters, which are
> not supported outside nesting
>  
> Like this:
> 
> --- a/drivers/iommu/iommufd/selftest.c
> +++ b/drivers/iommu/iommufd/selftest.c
> @@ -286,14 +286,12 @@ mock_domain_alloc_user(struct device *dev, u32 flags,
>         int rc;
>  
>         /* must be mock_domain */
> -       if (!user_data) {
> +       if (!parent) {
>                 struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
>                 bool has_dirty_flag = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>                 bool no_dirty_ops = mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY;
>  
> -               if (parent)
> -                       return ERR_PTR(-EINVAL);
> -               if (has_dirty_flag && no_dirty_ops)
> +               if (user_data || (has_dirty_flag && no_dirty_ops))
>                         return ERR_PTR(-EOPNOTSUPP);
>                 return __mock_domain_alloc_paging(IOMMU_DOMAIN_UNMANAGED,
>                                                   has_dirty_flag);

Yea.. Then the vt-d driver needs a similar change too (@Yi) as I
found it almost doing the same:
https://lore.kernel.org/linux-iommu/20231024151412.50046-8-yi.l.liu@intel.com/

Thanks
Nic
  
Yi Liu Oct. 25, 2023, 4:05 a.m. UTC | #10
On 2023/10/25 02:19, Nicolin Chen wrote:
> On Tue, Oct 24, 2023 at 03:00:49PM -0300, Jason Gunthorpe wrote:
>> On Tue, Oct 24, 2023 at 10:50:58AM -0700, Nicolin Chen wrote:
>>
>>>> The point is for the user_data to always be available, the driver
>>>> needs to check it if it is passed.
>>>>
>>>> This should all be plumbed to allow drivers to also customize their
>>>> paging domains too.
>>>
>>> We don't have a use case of customizing the paging domains.
>>> And our selftest isn't covering this path. Nor the case is
>>> supported by the uAPI:
>>
>> But this is the design, it is why everything is setup like this - we
>> didn't create a new op to allocate nesting domains, we made a flexible
>> user allocator.
>>
>>> 458- * A kernel-managed HWPT will be created with the mappings from the given
>>> 459- * IOAS via the @pt_id. The @data_type for this allocation must be set to
>>> 460: * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
>>> 461- * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
>>> 462- *
>>
>> Yes, that is the reality today. If someone comes to use the more
>> complete interface they need to fix that comment..
> 
> Ack.
> 
>>> Also, if we do passing in the data, we'd need to...
>>   
>>> 280-static struct iommu_domain *
>>> 281-mock_domain_alloc_user(struct device *dev, u32 flags,
>>> 282-		       struct iommu_domain *parent,
>>> 283:		       const struct iommu_user_data *user_data)
>>> 284-{
>>> 285-	struct mock_iommu_domain *mock_parent;
>>> 286-	struct iommu_hwpt_selftest user_cfg;
>>> 287-	int rc;
>>> 288-
>>> 289:	if (!user_data) {	/* must be mock_domain */
>>>
>>> ...change this to if (!parent)...
>>
>> Yes, this logic is not ideal. The parent is the request for nesting,
>> not user_data. user_data is the generic creation parameters, which are
>> not supported outside nesting
>>   
>> Like this:
>>
>> --- a/drivers/iommu/iommufd/selftest.c
>> +++ b/drivers/iommu/iommufd/selftest.c
>> @@ -286,14 +286,12 @@ mock_domain_alloc_user(struct device *dev, u32 flags,
>>          int rc;
>>   
>>          /* must be mock_domain */
>> -       if (!user_data) {
>> +       if (!parent) {
>>                  struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
>>                  bool has_dirty_flag = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>                  bool no_dirty_ops = mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY;
>>   
>> -               if (parent)
>> -                       return ERR_PTR(-EINVAL);
>> -               if (has_dirty_flag && no_dirty_ops)
>> +               if (user_data || (has_dirty_flag && no_dirty_ops))
>>                          return ERR_PTR(-EOPNOTSUPP);
>>                  return __mock_domain_alloc_paging(IOMMU_DOMAIN_UNMANAGED,
>>                                                    has_dirty_flag);
> 
> Yea.. Then the vt-d driver needs a similar change too (@Yi) as I
> found it almost doing the same:
> https://lore.kernel.org/linux-iommu/20231024151412.50046-8-yi.l.liu@intel.com/
> 
yes. mock driver is kind of sample code, so the intel iommu driver is doing
almost the same thing. will follow up with the branch Jason shared.
  
Yi Liu Oct. 25, 2023, 10:19 a.m. UTC | #11
On 2023/10/25 01:30, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 02:18:10PM -0300, Jason Gunthorpe wrote:
>> On Tue, Oct 24, 2023 at 08:06:06AM -0700, Yi Liu wrote:
>>> @@ -195,6 +279,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>>>   	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
>>>   		struct iommufd_hwpt_paging *hwpt_paging;
>>>   
>>> +		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
>>> +			rc = -EINVAL;
>>> +			goto out_put_pt;
>>> +		}
>>>   		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
>>>   		mutex_lock(&ioas->mutex);
>>>   		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
>>
>> ?? What is this?
>>
>> Ah something went wrong earlier in "iommu: Pass in parent domain with
>> user_data to domain_alloc_user op"
> 
> Bah, I got confused because that had half the uapi, so in this pathc
>   
>> Once we added the user_data we should flow it through to the op
>> always.
> 
> Like this:

ack.

> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 92859b907bb93c..a3382811af8a81 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -586,8 +586,8 @@ iommufd_device_auto_get_domain(struct iommufd_device *idev,
>   		goto out_unlock;
>   	}
>   
> -	hwpt_paging = iommufd_hwpt_paging_alloc(idev->ictx, ioas, idev,
> -						 0, immediate_attach);
> +	hwpt_paging = iommufd_hwpt_paging_alloc(idev->ictx, ioas, idev, 0,
> +						immediate_attach, NULL);
>   	if (IS_ERR(hwpt_paging)) {
>   		destroy_hwpt = ERR_CAST(hwpt_paging);
>   		goto out_unlock;
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index cfd85df693d7b2..324a6d73f032ee 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -96,7 +96,8 @@ iommufd_hwpt_paging_enforce_cc(struct iommufd_hwpt_paging *hwpt_paging)
>   struct iommufd_hwpt_paging *
>   iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>   			  struct iommufd_device *idev, u32 flags,
> -			  bool immediate_attach)
> +			  bool immediate_attach,
> +			  const struct iommu_user_data *user_data)
>   {
>   	const u32 valid_flags = IOMMU_HWPT_ALLOC_NEST_PARENT |
>   				IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> @@ -107,7 +108,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>   
>   	lockdep_assert_held(&ioas->mutex);
>   
> -	if (flags && !ops->domain_alloc_user)
> +	if ((flags || user_data) && !ops->domain_alloc_user)
>   		return ERR_PTR(-EOPNOTSUPP);
>   	if (flags & ~valid_flags)
>   		return ERR_PTR(-EOPNOTSUPP);
> @@ -127,7 +128,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>   
>   	if (ops->domain_alloc_user) {
>   		hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
> -						      NULL, NULL);
> +						      NULL, user_data);
>   		if (IS_ERR(hwpt->domain)) {
>   			rc = PTR_ERR(hwpt->domain);
>   			hwpt->domain = NULL;
> @@ -210,8 +211,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>   	struct iommufd_hw_pagetable *hwpt;
>   	int rc;
>   
> -	if (flags || user_data->type == IOMMU_HWPT_DATA_NONE ||
> -	    !ops->domain_alloc_user)
> +	if (flags || !user_data->len || !ops->domain_alloc_user)
>   		return ERR_PTR(-EOPNOTSUPP);
>   	if (parent->auto_domain || !parent->nest_parent)
>   		return ERR_PTR(-EINVAL);
> @@ -249,6 +249,11 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>   int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>   {
>   	struct iommu_hwpt_alloc *cmd = ucmd->cmd;
> +	const struct iommu_user_data user_data = {
> +		.type = cmd->data_type,
> +		.uptr = u64_to_user_ptr(cmd->data_uptr),
> +		.len = cmd->data_len,
> +	};
>   	struct iommufd_hw_pagetable *hwpt;
>   	struct iommufd_ioas *ioas = NULL;
>   	struct iommufd_object *pt_obj;
> @@ -273,25 +278,17 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>   	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
>   		struct iommufd_hwpt_paging *hwpt_paging;
>   
> -		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
> -			rc = -EINVAL;
> -			goto out_put_pt;
> -		}

it can be covered by iommu driver when checking user data pointer. hence
remove above if.

>   		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
>   		mutex_lock(&ioas->mutex);
> -		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
> -							cmd->flags, false);
> +		hwpt_paging = iommufd_hwpt_paging_alloc(
> +			ucmd->ictx, ioas, idev, cmd->flags, false,
> +			user_data.len ? &user_data : NULL);
>   		if (IS_ERR(hwpt_paging)) {
>   			rc = PTR_ERR(hwpt_paging);
>   			goto out_unlock;
>   		}
>   		hwpt = &hwpt_paging->common;
>   	} else if (pt_obj->type == IOMMUFD_OBJ_HWPT_PAGING) {
> -		const struct iommu_user_data user_data = {
> -			.type = cmd->data_type,
> -			.uptr = u64_to_user_ptr(cmd->data_uptr),
> -			.len = cmd->data_len,
> -		};
>   		struct iommufd_hwpt_nested *hwpt_nested;
>   		struct iommufd_hwpt_paging *parent;
>   
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 6fdad56893af4d..24e5a36fc875e0 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -290,7 +290,8 @@ int iommufd_hwpt_get_dirty_bitmap(struct iommufd_ucmd *ucmd);
>   struct iommufd_hwpt_paging *
>   iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>   			  struct iommufd_device *idev, u32 flags,
> -			  bool immediate_attach);
> +			  bool immediate_attach,
> +			  const struct iommu_user_data *user_data);
>   int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>   				struct iommufd_device *idev);
>   struct iommufd_hw_pagetable *
  

Patch

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 902288ca910d..6dbd58a718e4 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -627,6 +627,7 @@  static int iommufd_device_change_pt(struct iommufd_device *idev, u32 *pt_id,
 		return PTR_ERR(pt_obj);
 
 	switch (pt_obj->type) {
+	case IOMMUFD_OBJ_HWPT_NESTED:
 	case IOMMUFD_OBJ_HWPT_PAGING: {
 		struct iommufd_hw_pagetable *hwpt =
 			container_of(pt_obj, struct iommufd_hw_pagetable, obj);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index fbe09e2473bc..d41428a0fdc2 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -44,6 +44,22 @@  void iommufd_hwpt_paging_abort(struct iommufd_object *obj)
 	iommufd_hwpt_paging_destroy(obj);
 }
 
+void iommufd_hwpt_nested_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hwpt_nested *hwpt_nested =
+		container_of(obj, struct iommufd_hwpt_nested, common.obj);
+
+	if (hwpt_nested->common.domain)
+		iommu_domain_free(hwpt_nested->common.domain);
+
+	refcount_dec(&hwpt_nested->parent->common.obj.users);
+}
+
+void iommufd_hwpt_nested_abort(struct iommufd_object *obj)
+{
+	iommufd_hwpt_nested_destroy(obj);
+}
+
 static int
 iommufd_hwpt_paging_enforce_cc(struct iommufd_hwpt_paging *hwpt_paging)
 {
@@ -107,6 +123,7 @@  iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	/* Pairs with iommufd_hw_pagetable_destroy() */
 	refcount_inc(&ioas->obj.users);
 	hwpt_paging->ioas = ioas;
+	hwpt_paging->nest_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
 
 	if (ops->domain_alloc_user) {
 		hwpt->domain = ops->domain_alloc_user(idev->dev, flags,
@@ -170,6 +187,73 @@  iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	return ERR_PTR(rc);
 }
 
+/**
+ * iommufd_hwpt_nested_alloc() - Get a NESTED iommu_domain for a device
+ * @ictx: iommufd context
+ * @parent: Parent PAGING-type hwpt to associate the domain with
+ * @idev: Device to get an iommu_domain for
+ * @flags: Flags from userspace
+ * @user_data: user_data pointer. Must be valid
+ *
+ * Allocate a new iommu_domain (must be IOMMU_DOMAIN_NESTED) and return it as
+ * a NESTED hw_pagetable. The given parent PAGING-type hwpt must be capable of
+ * being a parent.
+ */
+static struct iommufd_hwpt_nested *
+iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
+			  struct iommufd_hwpt_paging *parent,
+			  struct iommufd_device *idev, u32 flags,
+			  const struct iommu_user_data *user_data)
+{
+	const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
+	struct iommufd_hwpt_nested *hwpt_nested;
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	if (flags != 0)
+		return ERR_PTR(-EOPNOTSUPP);
+	if (!user_data)
+		return ERR_PTR(-EINVAL);
+	if (user_data->type == IOMMU_HWPT_DATA_NONE)
+		return ERR_PTR(-EINVAL);
+	if (parent->auto_domain)
+		return ERR_PTR(-EINVAL);
+	if (!parent->nest_parent)
+		return ERR_PTR(-EINVAL);
+
+	if (!ops->domain_alloc_user)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	hwpt_nested = __iommufd_object_alloc(ictx, hwpt_nested,
+					     IOMMUFD_OBJ_HWPT_NESTED,
+					     common.obj);
+	if (IS_ERR(hwpt_nested))
+		return ERR_CAST(hwpt_nested);
+	hwpt = &hwpt_nested->common;
+
+	refcount_inc(&parent->common.obj.users);
+	hwpt_nested->parent = parent;
+
+	hwpt->domain = ops->domain_alloc_user(idev->dev, 0,
+					      parent->common.domain,
+					      user_data);
+	if (IS_ERR(hwpt->domain)) {
+		rc = PTR_ERR(hwpt->domain);
+		hwpt->domain = NULL;
+		goto out_abort;
+	}
+
+	if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
+		rc = -EINVAL;
+		goto out_abort;
+	}
+	return hwpt_nested;
+
+out_abort:
+	iommufd_object_abort_and_destroy(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
+
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_hwpt_alloc *cmd = ucmd->cmd;
@@ -195,6 +279,10 @@  int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	if (pt_obj->type == IOMMUFD_OBJ_IOAS) {
 		struct iommufd_hwpt_paging *hwpt_paging;
 
+		if (cmd->data_type != IOMMU_HWPT_DATA_NONE) {
+			rc = -EINVAL;
+			goto out_put_pt;
+		}
 		ioas = container_of(pt_obj, struct iommufd_ioas, obj);
 		mutex_lock(&ioas->mutex);
 		hwpt_paging = iommufd_hwpt_paging_alloc(ucmd->ictx, ioas, idev,
@@ -204,6 +292,23 @@  int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 			goto out_unlock;
 		}
 		hwpt = &hwpt_paging->common;
+	} else if (pt_obj->type == IOMMUFD_OBJ_HWPT_PAGING) {
+		const struct iommu_user_data user_data = {
+			.type = cmd->data_type,
+			.uptr = u64_to_user_ptr(cmd->data_uptr),
+			.len = cmd->data_len,
+		};
+		struct iommufd_hwpt_nested *hwpt_nested;
+		struct iommufd_hwpt_paging *parent;
+
+		parent = container_of(pt_obj, typeof(*parent), common.obj);
+		hwpt_nested = iommufd_hwpt_nested_alloc(ucmd->ictx, parent, idev,
+							cmd->flags, &user_data);
+		if (IS_ERR(hwpt_nested)) {
+			rc = PTR_ERR(hwpt_nested);
+			goto out_unlock;
+		}
+		hwpt = &hwpt_nested->common;
 	} else {
 		rc = -EINVAL;
 		goto out_put_pt;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6493fcdad1b7..6fdad56893af 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -124,6 +124,7 @@  enum iommufd_object_type {
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HWPT_PAGING,
+	IOMMUFD_OBJ_HWPT_NESTED,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_ACCESS,
 #ifdef CONFIG_IOMMUFD_TEST
@@ -255,10 +256,16 @@  struct iommufd_hwpt_paging {
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
 	bool msi_cookie : 1;
+	bool nest_parent : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 };
 
+struct iommufd_hwpt_nested {
+	struct iommufd_hw_pagetable common;
+	struct iommufd_hwpt_paging *parent;
+};
+
 static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt)
 {
 	return hwpt->obj.type == IOMMUFD_OBJ_HWPT_PAGING;
@@ -290,18 +297,28 @@  struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_detach(struct iommufd_device *idev);
 void iommufd_hwpt_paging_destroy(struct iommufd_object *obj);
 void iommufd_hwpt_paging_abort(struct iommufd_object *obj);
+void iommufd_hwpt_nested_destroy(struct iommufd_object *obj);
+void iommufd_hwpt_nested_abort(struct iommufd_object *obj);
 int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd);
 
 static inline void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 					    struct iommufd_hw_pagetable *hwpt)
 {
-	struct iommufd_hwpt_paging *hwpt_paging = to_hwpt_paging(hwpt);
+	if (WARN_ON(hwpt->obj.type != IOMMUFD_OBJ_HWPT_PAGING &&
+		    hwpt->obj.type != IOMMUFD_OBJ_HWPT_NESTED))
+		return;
 
-	lockdep_assert_not_held(&hwpt_paging->ioas->mutex);
-	if (hwpt_paging->auto_domain)
-		iommufd_object_deref_user(ictx, &hwpt->obj);
-	else
-		refcount_dec(&hwpt->obj.users);
+	if (hwpt->obj.type == IOMMUFD_OBJ_HWPT_PAGING) {
+		struct iommufd_hwpt_paging *hwpt_paging = to_hwpt_paging(hwpt);
+
+		lockdep_assert_not_held(&hwpt_paging->ioas->mutex);
+
+		if (hwpt_paging->auto_domain) {
+			iommufd_object_deref_user(ictx, &hwpt->obj);
+			return;
+		}
+	}
+	refcount_dec(&hwpt->obj.users);
 }
 
 struct iommufd_group {
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ab6675a7f6b4..45b9d40773b1 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -492,6 +492,10 @@  static const struct iommufd_object_ops iommufd_object_ops[] = {
 		.destroy = iommufd_hwpt_paging_destroy,
 		.abort = iommufd_hwpt_paging_abort,
 	},
+	[IOMMUFD_OBJ_HWPT_NESTED] = {
+		.destroy = iommufd_hwpt_nested_destroy,
+		.abort = iommufd_hwpt_nested_abort,
+	},
 #ifdef CONFIG_IOMMUFD_TEST
 	[IOMMUFD_OBJ_SELFTEST] = {
 		.destroy = iommufd_selftest_destroy,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index fccc6315a520..d816deac906f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -374,15 +374,31 @@  enum iommu_hwpt_data_type {
  * @size: sizeof(struct iommu_hwpt_alloc)
  * @flags: Combination of enum iommufd_hwpt_alloc_flags
  * @dev_id: The device to allocate this HWPT for
- * @pt_id: The IOAS to connect this HWPT to
+ * @pt_id: The IOAS or HWPT to connect this HWPT to
  * @out_hwpt_id: The ID of the new HWPT
  * @__reserved: Must be 0
+ * @data_type: One of enum iommu_hwpt_data_type
+ * @data_len: Length of the type specific data
+ * @data_uptr: User pointer to the type specific data
  *
  * Explicitly allocate a hardware page table object. This is the same object
  * type that is returned by iommufd_device_attach() and represents the
  * underlying iommu driver's iommu_domain kernel object.
  *
- * A HWPT will be created with the IOVA mappings from the given IOAS.
+ * A kernel-managed HWPT will be created with the mappings from the given
+ * IOAS via the @pt_id. The @data_type for this allocation must be set to
+ * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
+ * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
+ *
+ * A user-managed nested HWPT will be created from a given parent HWPT via
+ * @pt_id, in which the parent HWPT must be allocated previously via the
+ * same ioctl from a given IOAS (@pt_id). In this case, the @data_type
+ * must be set to a pre-defined type corresponding to an I/O page table
+ * type supported by the underlying IOMMU hardware.
+ *
+ * If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and
+ * @data_uptr should be zero. Otherwise, both @data_len and @data_uptr
+ * must be given.
  */
 struct iommu_hwpt_alloc {
 	__u32 size;
@@ -391,6 +407,9 @@  struct iommu_hwpt_alloc {
 	__u32 pt_id;
 	__u32 out_hwpt_id;
 	__u32 __reserved;
+	__u32 data_type;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
 };
 #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)