[RFC,6/6] iommu/amd: Introduce nested translation support

Message ID 20231212160139.174229-7-suravee.suthikulpanit@amd.com
State New
Headers
Series iommu/amd: Introduce hardware info reporting and nested translation support |

Commit Message

Suravee Suthikulpanit Dec. 12, 2023, 4:01 p.m. UTC
  To support nested translation on AMD IOMMU, the driver needs to
program DTE[GCR3 Table Root Pointer] with the address provided by
the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
of the struct iommu_ops.domain_alloc_user() with the flag
IOMMU_HWPT_ALLOC_NEST_PARENT.

Note that current implementation only support GCR3TRPMode for
nested translation, which uses GPA to program GCR3 Table Root Pointer.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/Makefile          |   2 +-
 drivers/iommu/amd/amd_iommu.h       |   8 +++
 drivers/iommu/amd/amd_iommu_types.h |   3 +
 drivers/iommu/amd/iommu.c           |  63 ++++++++++++++--
 drivers/iommu/amd/nested.c          | 107 ++++++++++++++++++++++++++++
 5 files changed, 175 insertions(+), 8 deletions(-)
 create mode 100644 drivers/iommu/amd/nested.c
  

Comments

Jason Gunthorpe Dec. 13, 2023, 1:52 p.m. UTC | #1
On Tue, Dec 12, 2023 at 10:01:39AM -0600, Suravee Suthikulpanit wrote:

> -	if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
> +		ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
> +		if (ret)
> +			return ERR_PTR(ret);
> +
> +		return amd_iommu_nested_domain_alloc(dev, &hwpt);
> +	}
> +
> +	/* Check supported flags */
> +	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
> +		       IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	if (!check_nested_support(flags))
>  		return ERR_PTR(-EOPNOTSUPP);
>  
> -	return do_iommu_domain_alloc(type, dev, flags);
> +	dom = iommu_domain_alloc(dev->bus);

Please don't call iommu_domain_alloc, call your internal function and
force it to allocate the v1 domain..

> +static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
> +{
> +	int ret;
> +	u16 hdev_id;
> +	struct pci_dev *pdev;
> +	struct amd_iommu *iommu;
> +
> +	iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
> +	hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
> +
> +	pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
> +		 __func__, hwpt->gid, hdev_id,
> +		 (unsigned long long) hwpt->gcr3);
> +
> +	pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
> +					   hdev_id & 0xff);

Huh? "hdev_id"? This is not OK..

The device you are allowed to look at is the "struct device *dev" passed
to alloc. You cannot pass in a struct device and then override it with
another value.

> +	if (!pdev)
> +		return -EINVAL;
> +
> +	/* Note: Currently only support GCR3TRPMode with nested translation */
> +	if (!check_feature2(FEATURE_GCR3TRPMODE))
> +		return -EOPNOTSUPP;
> +
> +	ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
> +					hwpt->guest_paging_mode);

Waah?

This is touching the dev table? That is not right, allocation is only
*ALLOCATION*. The dev table can't be changed until you do attachment.

Please look at the smmuv3 patches and try to be structurally
similar. AMD and SMMUv3 are *very similar* in how their HW works
excluding the viommu stuff.

You also can't assume your parent is currently attached to anything.

The construction of the DTE has to be from-scratch based on the parent
domain and the provided values in the "hwpt". Again see how smmuv3
does this where there is one function that builds the entire DTE
(called STE)

I'm skeptical you can do this properly without also restructuring the
DTE logic like I've mentioned before, there is a reason I did that for
SMMUv3. :)

> +struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
> +						   struct iommu_hwpt_amd_v2 *hwpt)
> +{
> +	int ret;
> +	struct iommu_domain *dom;
> +	struct protection_domain *pdom;
> +
> +	dom = iommu_domain_alloc(dev->bus);
> +	if (!dom)
> +		return ERR_PTR(-ENOMEM);

Also no, do not allocate a normal domain and then 'wreck'
it into a nesting domain. Refactor the allocation code to be in
smaller chucks so you can alloc and init the memory directly here.

Jason
  
Tian, Kevin Dec. 15, 2023, 7:45 a.m. UTC | #2
> From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> Sent: Wednesday, December 13, 2023 12:02 AM
> 
> To support nested translation on AMD IOMMU, the driver needs to
> program DTE[GCR3 Table Root Pointer] with the address provided by
> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
> of the struct iommu_ops.domain_alloc_user() with the flag
> IOMMU_HWPT_ALLOC_NEST_PARENT.
> 
> Note that current implementation only support GCR3TRPMode for
> nested translation, which uses GPA to program GCR3 Table Root Pointer.
> 

means there is a plan to support another mode in the future or
actually the nested translation requires GCR3TRPMode as a
functional requirement? imho the point of GPA is assumed
in the nested configuration in concept...
  
Suravee Suthikulpanit Jan. 5, 2024, 1:38 p.m. UTC | #3
Hi Jason

On 12/13/2023 8:52 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 10:01:39AM -0600, Suravee Suthikulpanit wrote:
> 
>> -	if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
>> +		ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
>> +		if (ret)
>> +			return ERR_PTR(ret);
>> +
>> +		return amd_iommu_nested_domain_alloc(dev, &hwpt);
>> +	}
>> +
>> +	/* Check supported flags */
>> +	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
>> +		       IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
>> +		return ERR_PTR(-EOPNOTSUPP);
>> +
>> +	if (!check_nested_support(flags))
>>   		return ERR_PTR(-EOPNOTSUPP);
>>   
>> -	return do_iommu_domain_alloc(type, dev, flags);
>> +	dom = iommu_domain_alloc(dev->bus);
> 
> Please don't call iommu_domain_alloc, call your internal function and
> force it to allocate the v1 domain..

Okay.

>> +static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
>> +{
>> +	int ret;
>> +	u16 hdev_id;
>> +	struct pci_dev *pdev;
>> +	struct amd_iommu *iommu;
>> +
>> +	iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
>> +	hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
>> +
>> +	pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
>> +		 __func__, hwpt->gid, hdev_id,
>> +		 (unsigned long long) hwpt->gcr3);
>> +
>> +	pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
>> +					   hdev_id & 0xff);
> 
> Huh? "hdev_id"? This is not OK..
> 
> The device you are allowed to look at is the "struct device *dev" passed
> to alloc. You cannot pass in a struct device and then override it with
> another value.

Good point. I'll fix this to use the dev.

>> +	if (!pdev)
>> +		return -EINVAL;
>> +
>> +	/* Note: Currently only support GCR3TRPMode with nested translation */
>> +	if (!check_feature2(FEATURE_GCR3TRPMODE))
>> +		return -EOPNOTSUPP;
>> +
>> +	ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
>> +					hwpt->guest_paging_mode);
> 
> Waah?
> 
> This is touching the dev table? That is not right, allocation is only
> *ALLOCATION*. The dev table can't be changed until you do attachment.

My understanding is QEMU should call:

1. iommufd_backend_get_ioas()
    -> ioctl(IOMMU_IOAS_ALLOC)

2. For parent domain
iommufd_backend_alloc_hwpt(IOMMU_HWPT_ALLOC_NEST_PARENT)
   -> ioctl( IOMMU_HWPT_ALLOC)
   --- in Linux ---
   ....
   -> iommufd_hwpt_paging_alloc()
     -> struct iommu_ops.domain_alloc_user(IOMMU_HWPT_ALLOC_NEST_PARENT)

3. For parent domain
iommufd_device_attach_hwpt()
   -> vfio_device_attach_container()
     -> ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT)
     --- in Linux ---
     vfio_iommufd_physical_attach_ioas()
     -> iommufd_device_attach()
       -> iommufd_device_do_attach()
         -> iommufd_hw_pagetable_attach()

4. Same as (2) but for child domain w/ stage 1 table.

5. Same as (3) but for child domain w/ stage 1 table.

You want the gcr3 table root point in the DTE to be update at step 5 
only, right? If so, this should be okay.

> Please look at the smmuv3 patches and try to be structurally
> similar. AMD and SMMUv3 are *very similar* in how their HW works
> excluding the viommu stuff.

Could you please point me to the series? I found one but it was really 
old. I might have missed the latest stuff.

> You also can't assume your parent is currently attached to anything.
> 
> The construction of the DTE has to be from-scratch based on the parent
> domain and the provided values in the "hwpt". Again see how smmuv3
> does this where there is one function that builds the entire DTE
> (called STE)

Ok. I'll program fields of the DTE, which are related to DMA-remapping 
with v1 and v2 table using the information in the parent and child 
domains only.

> I'm skeptical you can do this properly without also restructuring the
> DTE logic like I've mentioned before, there is a reason I did that for
> SMMUv3. :)
A device can be attached to a domain, which could be either one-level or 
nested domain. In case of nesting, the parent domain contains 
information for the stage2 (v1) table, while the child domain contains 
information for the stage1 (v2) table. For each domain, we need to keep 
track of the parent domain.

When calling set_dte_entry(), we need to check if the device is attached 
to a domain, which has parent. If so, we need to configure DTE using 
information in both domains accordingly.

I'll update the set_dte_entry() to reflect this logic for programming DTE.

>> +struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
>> +						   struct iommu_hwpt_amd_v2 *hwpt)
>> +{
>> +	int ret;
>> +	struct iommu_domain *dom;
>> +	struct protection_domain *pdom;
>> +
>> +	dom = iommu_domain_alloc(dev->bus);
>> +	if (!dom)
>> +		return ERR_PTR(-ENOMEM);
> 
>
> Also no, do not allocate a normal domain and then 'wreck'
> it into a nesting domain. Refactor the allocation code to be in
> smaller chucks so you can alloc and init the memory directly here.

Good point. I'll take care of this in the drivers/iommu/amd/iommmu.c: 
do_iommu_domain_alloc().

Thanks,
Suravee
  
Suravee Suthikulpanit Jan. 5, 2024, 1:39 p.m. UTC | #4
Hi Kevin,

On 12/15/2023 2:45 PM, Tian, Kevin wrote:
>> From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>> Sent: Wednesday, December 13, 2023 12:02 AM
>>
>> To support nested translation on AMD IOMMU, the driver needs to
>> program DTE[GCR3 Table Root Pointer] with the address provided by
>> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
>> of the struct iommu_ops.domain_alloc_user() with the flag
>> IOMMU_HWPT_ALLOC_NEST_PARENT.
>>
>> Note that current implementation only support GCR3TRPMode for
>> nested translation, which uses GPA to program GCR3 Table Root Pointer.
>>
> 
> means there is a plan to support another mode in the future or
> actually the nested translation requires GCR3TRPMode as a
> functional requirement? imho the point of GPA is assumed
> in the nested configuration in concept...

On (older) system, which does not support GCR3TRPMode, the IOMMU driver 
needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.

When QEMU presents an AMD vIOMMU device to a guest, the guest programs 
the guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
1. Traps the DTE write
2. Translate the GPA->SPA
3. Program DTE with SPA.

With the GCR3TRPMode, we can skip step 2 above and directly program step 
3 with GPA.

Suravee
  
Jason Gunthorpe Jan. 5, 2024, 2:31 p.m. UTC | #5
On Fri, Jan 05, 2024 at 08:38:47PM +0700, Suthikulpanit, Suravee wrote:
> > > +	if (!pdev)
> > > +		return -EINVAL;
> > > +
> > > +	/* Note: Currently only support GCR3TRPMode with nested translation */
> > > +	if (!check_feature2(FEATURE_GCR3TRPMODE))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
> > > +					hwpt->guest_paging_mode);
> > 
> > Waah?
> > 
> > This is touching the dev table? That is not right, allocation is only
> > *ALLOCATION*. The dev table can't be changed until you do attachment.
> 
> My understanding is QEMU should call:
> 
> 1. iommufd_backend_get_ioas()
>    -> ioctl(IOMMU_IOAS_ALLOC)
> 
> 2. For parent domain
> iommufd_backend_alloc_hwpt(IOMMU_HWPT_ALLOC_NEST_PARENT)
>   -> ioctl( IOMMU_HWPT_ALLOC)
>   --- in Linux ---
>   ....
>   -> iommufd_hwpt_paging_alloc()
>     -> struct iommu_ops.domain_alloc_user(IOMMU_HWPT_ALLOC_NEST_PARENT)
> 
> 3. For parent domain
> iommufd_device_attach_hwpt()
>   -> vfio_device_attach_container()
>     -> ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT)
>     --- in Linux ---
>     vfio_iommufd_physical_attach_ioas()
>     -> iommufd_device_attach()
>       -> iommufd_device_do_attach()
>         -> iommufd_hw_pagetable_attach()
> 
> 4. Same as (2) but for child domain w/ stage 1 table.
> 
> 5. Same as (3) but for child domain w/ stage 1 table.

Yes, but understand it is not API that the driver can depend on that
order. #3 can be skipped and there can be multiple parents and
everything must still work.

> You want the gcr3 table root point in the DTE to be update at step 5 only,
> right? If so, this should be okay.

#3 sets the DTE to point to just the parent domain as a V1 page table

#4/5 sets the DTE to point to both the V1 page table and the GCR3 table

The DTE is calculated based only on the *currently* attached
iommu_domain's for the struct device.

> > Please look at the smmuv3 patches and try to be structurally
> > similar. AMD and SMMUv3 are *very similar* in how their HW works
> > excluding the viommu stuff.
> 
> Could you please point me to the series? I found one but it was really old.
> I might have missed the latest stuff.

https://lore.kernel.org/linux-iommu/0-v1-e289ca9121be+2be-smmuv3_newapi_p1_jgg@nvidia.com/
https://lore.kernel.org/linux-iommu/0-v2-16665a652079+5947-smmuv3_newapi_p2_jgg@nvidia.com/

(and the github link has the other parts)

> > You also can't assume your parent is currently attached to anything.
> > 
> > The construction of the DTE has to be from-scratch based on the parent
> > domain and the provided values in the "hwpt". Again see how smmuv3
> > does this where there is one function that builds the entire DTE
> > (called STE)
> 
> Ok. I'll program fields of the DTE, which are related to DMA-remapping with
> v1 and v2 table using the information in the parent and child domains only.

The smmu code looks like:

static void arm_smmu_make_nested_domain_ste(
	struct arm_smmu_ste *target, struct arm_smmu_master *master,
	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
{
        [..]
        // Incorporate the "v1 fields" into the "DTE"
	arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
				    ats_enabled);

        // Incorporate the "GCR3 fields" into the "DTE"
	target->data[0] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_0_CFG,
						  STRTAB_STE_0_CFG_NESTED)) |
			   nested_domain->ste[0];

Where "arm_smmu_ste *" is stack memory that holds the "DTE" being
constructed. Once the make* family of function figure out the exact
STE that the struct device needs, another function writes the STE to
the HW visible location(s).

The goal is to be able to calculate the required DTE for the #3 and
#4/5 cases you list above directly without making assumptions about
the current state of the DTE or anything else.

> A device can be attached to a domain, which could be either one-level or
> nested domain. In case of nesting, the parent domain contains information
> for the stage2 (v1) table, while the child domain contains information for
> the stage1 (v2) table. For each domain, we need to keep track of the parent
> domain.

The nesting domain, and only the nesting domain, stores a pointer to
it's parent domain - they are a unit together.

Jason
  
Jason Gunthorpe Jan. 5, 2024, 2:37 p.m. UTC | #6
On Fri, Jan 05, 2024 at 08:39:52PM +0700, Suthikulpanit, Suravee wrote:
> Hi Kevin,
> 
> On 12/15/2023 2:45 PM, Tian, Kevin wrote:
> > > From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> > > Sent: Wednesday, December 13, 2023 12:02 AM
> > > 
> > > To support nested translation on AMD IOMMU, the driver needs to
> > > program DTE[GCR3 Table Root Pointer] with the address provided by
> > > the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
> > > of the struct iommu_ops.domain_alloc_user() with the flag
> > > IOMMU_HWPT_ALLOC_NEST_PARENT.
> > > 
> > > Note that current implementation only support GCR3TRPMode for
> > > nested translation, which uses GPA to program GCR3 Table Root Pointer.
> > > 
> > 
> > means there is a plan to support another mode in the future or
> > actually the nested translation requires GCR3TRPMode as a
> > functional requirement? imho the point of GPA is assumed
> > in the nested configuration in concept...
> 
> On (older) system, which does not support GCR3TRPMode, the IOMMU driver
> needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.

Meaning that on older systems the GCR3 Table Root Pointer is not
translated by the parent v1 page table?

> When QEMU presents an AMD vIOMMU device to a guest, the guest programs the
> guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
> 1. Traps the DTE write
> 2. Translate the GPA->SPA
> 3. Program DTE with SPA.
> 
> With the GCR3TRPMode, we can skip step 2 above and directly program step 3
> with GPA.

Do you want to support this? It will be hard to do because it is not
just those three steps (which are easy) but that you have to somehow
maintain coherence with any changes to the parent page table, so you
have to hook the iommu_domain unmap as well...

Jason
  
Suravee Suthikulpanit Jan. 8, 2024, 6:49 a.m. UTC | #7
On 1/5/2024 9:37 PM, Jason Gunthorpe wrote:
> On Fri, Jan 05, 2024 at 08:39:52PM +0700, Suthikulpanit, Suravee wrote:
>> Hi Kevin,
>>
>> On 12/15/2023 2:45 PM, Tian, Kevin wrote:
>>>> From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>>>> Sent: Wednesday, December 13, 2023 12:02 AM
>>>>
>>>> To support nested translation on AMD IOMMU, the driver needs to
>>>> program DTE[GCR3 Table Root Pointer] with the address provided by
>>>> the guest via struct iommu_hwpt_amd_v2, which is passed as a parameter
>>>> of the struct iommu_ops.domain_alloc_user() with the flag
>>>> IOMMU_HWPT_ALLOC_NEST_PARENT.
>>>>
>>>> Note that current implementation only support GCR3TRPMode for
>>>> nested translation, which uses GPA to program GCR3 Table Root Pointer.
>>>>
>>>
>>> means there is a plan to support another mode in the future or
>>> actually the nested translation requires GCR3TRPMode as a
>>> functional requirement? imho the point of GPA is assumed
>>> in the nested configuration in concept...
>>
>> On (older) system, which does not support GCR3TRPMode, the IOMMU driver
>> needs to program the device's DTE[GCR3 Table Root Pointer] field w/ SPA.
> 
> Meaning that on older systems the GCR3 Table Root Pointer is not
> translated by the parent v1 page table?

Correct.

>> When QEMU presents an AMD vIOMMU device to a guest, the guest programs the
>> guest DTE[GCR3 Table Root Pointer] with GPA. Then we need to :
>> 1. Traps the DTE write
>> 2. Translate the GPA->SPA
>> 3. Program DTE with SPA.
>>
>> With the GCR3TRPMode, we can skip step 2 above and directly program step 3
>> with GPA.
> 
> Do you want to support this? It will be hard to do because it is not
> just those three steps (which are easy) but that you have to somehow
> maintain coherence with any changes to the parent page table, so you
> have to hook the iommu_domain unmap as well...

I'm debating this. Let me get back on this part.

Thanks,
Suravee
  

Patch

diff --git a/drivers/iommu/amd/Makefile b/drivers/iommu/amd/Makefile
index f454fbb1569e..447cb6bb48eb 100644
--- a/drivers/iommu/amd/Makefile
+++ b/drivers/iommu/amd/Makefile
@@ -1,3 +1,3 @@ 
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o
+obj-$(CONFIG_AMD_IOMMU) += iommu.o init.o quirks.o io_pgtable.o io_pgtable_v2.o nested.o
 obj-$(CONFIG_AMD_IOMMU_DEBUGFS) += debugfs.o
diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 55479a6efaae..6ea146a964df 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -7,6 +7,7 @@ 
 #ifndef AMD_IOMMU_H
 #define AMD_IOMMU_H
 
+#include <uapi/linux/iommufd.h>
 #include <linux/iommu.h>
 
 #include "amd_iommu_types.h"
@@ -75,6 +76,8 @@  void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
 				   ioasid_t pasid);
 
 void amd_iommu_build_efr(u64 *efr, u64 *efr2);
+int amd_iommu_attach_device(struct iommu_domain *dom, struct device *dev);
+void amd_iommu_domain_free(struct iommu_domain *dom);
 
 #ifdef CONFIG_IRQ_REMAP
 int amd_iommu_create_irq_domain(struct amd_iommu *iommu);
@@ -190,4 +193,9 @@  int amd_iommu_vminfo_alloc(struct amd_iommu *iommu, struct amd_iommu_vminfo *vmi
 void amd_iommu_vminfo_free(struct amd_iommu *iommu, struct amd_iommu_vminfo *vminfo);
 struct amd_iommu_vminfo *amd_iommu_get_vminfo(int gid);
 
+/* NESTED */
+struct protection_domain *to_pdomain(struct iommu_domain *dom);
+struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
+						   struct iommu_hwpt_amd_v2 *hwpt);
+
 #endif
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 1b150e0cb689..c2055b476a97 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -114,6 +114,8 @@ 
 #define FEATURE_PASMAX_MASK	(0x1FULL << FEATURE_PASMAX_SHIFT)
 
 /* Extended Feature 2 Bits */
+#define FEATURE_GCR3TRPMODE	BIT_ULL(3)
+
 #define FEATURE_SNPAVICSUP_SHIFT	5
 #define FEATURE_SNPAVICSUP_MASK		(0x07ULL << FEATURE_SNPAVICSUP_SHIFT)
 #define FEATURE_SNPAVICSUP_GAM(x) \
@@ -1058,6 +1060,7 @@  struct amd_irte_ops {
 struct amd_iommu_vminfo {
 	u16 gid;
 	struct hlist_node hnode;
+	u64 *devid_table;
 };
 
 #ifdef CONFIG_IRQ_REMAP
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 8bf12674dc84..2a7e29e8c112 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -260,7 +260,7 @@  static struct amd_iommu *rlookup_amd_iommu(struct device *dev)
 	return __rlookup_amd_iommu(seg, PCI_SBDF_TO_DEVID(devid));
 }
 
-static struct protection_domain *to_pdomain(struct iommu_domain *dom)
+struct protection_domain *to_pdomain(struct iommu_domain *dom)
 {
 	return container_of(dom, struct protection_domain, domain);
 }
@@ -2526,21 +2526,70 @@  static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type)
 	return domain;
 }
 
+static int udata_to_iommu_hwpt_amd_v2(const struct iommu_user_data *user_data,
+				       struct iommu_hwpt_amd_v2 *hwpt)
+{
+	if (!user_data)
+		return -EINVAL;
+
+	if (user_data->type != IOMMU_HWPT_DATA_AMD_V2)
+		return -EOPNOTSUPP;
+
+	return iommu_copy_struct_from_user(hwpt, user_data,
+					   IOMMU_HWPT_DATA_AMD_V2,
+					   guest_paging_mode);
+}
+
+static bool check_nested_support(u32 flags)
+{
+	if (!(flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+		return true;
+
+	if (!check_feature(FEATURE_GT) ||
+	    !check_feature(FEATURE_GIOSUP) ||
+	    !check_feature2(FEATURE_GCR3TRPMODE))
+		return false;
+
+	return true;
+}
+
 static struct iommu_domain *
 amd_iommu_domain_alloc_user(struct device *dev, u32 flags,
 			    struct iommu_domain *parent,
 			    const struct iommu_user_data *user_data)
-
 {
-	unsigned int type = IOMMU_DOMAIN_UNMANAGED;
+	struct iommu_domain *dom;
+
+	if (parent) {
+		int ret;
+		struct iommu_hwpt_amd_v2 hwpt;
+
+		if (parent->ops != amd_iommu_ops.default_domain_ops)
+			return ERR_PTR(-EINVAL);
 
-	if ((flags & ~IOMMU_HWPT_ALLOC_DIRTY_TRACKING) || parent || user_data)
+		ret = udata_to_iommu_hwpt_amd_v2(user_data, &hwpt);
+		if (ret)
+			return ERR_PTR(ret);
+
+		return amd_iommu_nested_domain_alloc(dev, &hwpt);
+	}
+
+	/* Check supported flags */
+	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT |
+		       IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!check_nested_support(flags))
 		return ERR_PTR(-EOPNOTSUPP);
 
-	return do_iommu_domain_alloc(type, dev, flags);
+	dom = iommu_domain_alloc(dev->bus);
+	if (!dom)
+		return ERR_PTR(-ENOMEM);
+
+	return dom;
 }
 
-static void amd_iommu_domain_free(struct iommu_domain *dom)
+void amd_iommu_domain_free(struct iommu_domain *dom)
 {
 	struct protection_domain *domain;
 	unsigned long flags;
@@ -2559,7 +2608,7 @@  static void amd_iommu_domain_free(struct iommu_domain *dom)
 	protection_domain_free(domain);
 }
 
-static int amd_iommu_attach_device(struct iommu_domain *dom,
+int amd_iommu_attach_device(struct iommu_domain *dom,
 				   struct device *dev)
 {
 	struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
diff --git a/drivers/iommu/amd/nested.c b/drivers/iommu/amd/nested.c
new file mode 100644
index 000000000000..332f7efcdc92
--- /dev/null
+++ b/drivers/iommu/amd/nested.c
@@ -0,0 +1,107 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Advanced Micro Devices, Inc.
+ * Author: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
+ */
+
+#define pr_fmt(fmt)     "AMD-Vi: " fmt
+#define dev_fmt(fmt)    pr_fmt(fmt)
+
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "amd_iommu.h"
+
+static struct amd_iommu *get_amd_iommu_from_devid(u16 devid)
+{
+	struct amd_iommu *iommu;
+
+	for_each_iommu(iommu)
+		if (iommu->devid == devid)
+			return iommu;
+	return NULL;
+}
+
+/*
+ * Note:
+ * Host-DevID is stored in the per-VM DevID mapping table,
+ * which is indexed by the Guest-DevID.
+ */
+static u16 get_hdev_id(struct amd_iommu *iommu, u16 guestId, u16 gdev_id)
+{
+	struct amd_iommu_vminfo *vminfo;
+	void *addr;
+	u64 offset;
+
+	vminfo = amd_iommu_get_vminfo(guestId);
+	if (!vminfo)
+		return -1;
+
+	addr = vminfo->devid_table;
+	offset = gdev_id << 4;
+	return (*((u64 *)(addr + offset)) >> 24) & 0xFFFF;
+}
+
+static int nested_gcr3_update(struct iommu_hwpt_amd_v2 *hwpt, struct iommu_domain *udom)
+{
+	int ret;
+	u16 hdev_id;
+	struct pci_dev *pdev;
+	struct amd_iommu *iommu;
+
+	iommu = get_amd_iommu_from_devid(hwpt->iommu_id);
+	hdev_id = get_hdev_id(iommu, hwpt->gid, hwpt->gdev_id);
+
+	pr_debug("%s: gid=%u, hdev_id=%#x, gcr3=%#llx\n",
+		 __func__, hwpt->gid, hdev_id,
+		 (unsigned long long) hwpt->gcr3);
+
+	pdev = pci_get_domain_bus_and_slot(0, PCI_BUS_NUM(hdev_id),
+					   hdev_id & 0xff);
+	if (!pdev)
+		return -EINVAL;
+
+	/* Note: Currently only support GCR3TRPMode with nested translation */
+	if (!check_feature2(FEATURE_GCR3TRPMODE))
+		return -EOPNOTSUPP;
+
+	ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
+					hwpt->guest_paging_mode);
+	if (ret) {
+		pr_err("%s: Fail to enable gcr3 (devid=%#x)\n", __func__,
+		       pci_dev_id(pdev));
+	}
+
+	return ret;
+}
+
+static const struct iommu_domain_ops nested_domain_ops = {
+	.attach_dev		= amd_iommu_attach_device,
+	.free			= amd_iommu_domain_free,
+};
+
+struct iommu_domain *amd_iommu_nested_domain_alloc(struct device *dev,
+						   struct iommu_hwpt_amd_v2 *hwpt)
+{
+	int ret;
+	struct iommu_domain *dom;
+	struct protection_domain *pdom;
+
+	dom = iommu_domain_alloc(dev->bus);
+	if (!dom)
+		return ERR_PTR(-ENOMEM);
+
+	pdom = to_pdomain(dom);
+	dom->type = IOMMU_DOMAIN_NESTED;
+	dom->ops = &nested_domain_ops;
+
+	ret = nested_gcr3_update(hwpt, dom);
+	if (ret)
+		goto err_out;
+
+	return dom;
+
+err_out:
+	iommu_domain_free(dom);
+	return ERR_PTR(-EINVAL);
+}