[v1,02/14] iommufd: Add nesting related data structures for ARM SMMUv3

Message ID 364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvidia.com
State New
Headers
Series Add Nested Translation Support for SMMUv3 |

Commit Message

Nicolin Chen March 9, 2023, 10:53 a.m. UTC
  Add the following data structures for corresponding ioctls:
               iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
    iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE

Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
to the header and corresponding type/size arrays.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/hw_pagetable.c |  4 +++
 drivers/iommu/iommufd/main.c         |  1 +
 include/uapi/linux/iommufd.h         | 50 ++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)
  

Comments

Jean-Philippe Brucker March 9, 2023, 1:42 p.m. UTC | #1
Hi Nicolin,

On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
> Add the following data structures for corresponding ioctls:
>                iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
>     iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE
> 
> Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
> to the header and corresponding type/size arrays.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

> +/**
> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
> + *
> + * @flags: page table entry attributes
> + * @s2vmid: Virtual machine identifier
> + * @s1ctxptr: Stage-1 context descriptor pointer
> + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
> + * @s1fmt: Stage-1 Format
> + * @s1dss: Default substream
> + */
> +struct iommu_hwpt_arm_smmuv3 {
> +#define IOMMU_SMMUV3_FLAG_S2	(1 << 0) /* if unset, stage-1 */

I don't understand the purpose of this flag, since the structure only
provides stage-1 configuration fields

> +#define IOMMU_SMMUV3_FLAG_VMID	(1 << 1) /* vmid override */

Doesn't this break isolation?  The VMID space is global for the SMMU, so
the guest could access cached mappings of another device

> +	__u64 flags;
> +	__u32 s2vmid;
> +	__u32 __reserved;
> +	__u64 s1ctxptr;
> +	__u64 s1cdmax;
> +	__u64 s1fmt;
> +	__u64 s1dss;
> +};
> +


> +/**
> + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
> + * @flags: boolean attributes of cache invalidation command
> + * @opcode: opcode of cache invalidation command
> + * @ssid: SubStream ID
> + * @granule_size: page/block size of the mapping in bytes
> + * @range: IOVA range to invalidate
> + */
> +struct iommu_hwpt_invalidate_arm_smmuv3 {
> +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF	(1 << 0)
> +	__u64 flags;
> +	__u8 opcode;
> +	__u8 padding[3];
> +	__u32 asid;
> +	__u32 ssid;
> +	__u32 granule_size;
> +	struct iommu_iova_range range;
> +};

Although we can keep the alloc and hardware info separate for each IOMMU
architecture, we should try to come up with common invalidation methods.

It matters because things like vSVA, or just efficient dynamic mappings,
will require optimal invalidation latency. A paravirtual interface like
vhost-iommu can help with that, as the host kernel will handle guest
invalidations directly instead of doing a round-trip to host userspace
(and we'll likely want the same path for PRI.)

Supporting HW page tables for a common PV IOMMU does require some
architecture-specific knowledge, but invalidation messages contain roughly
the same information on all architectures. The PV IOMMU won't include
command opcodes for each possible architecture if one generic command does
the same job.

Ideally I'd like something like this for vhost-iommu:

* slow path through userspace for complex requests like attach-table and
  probe, where the VMM can decode arch-specific information and translate
  them to iommufd and vhost-iommu ioctls to update the configuration.

* fast path within the kernel for performance-critical requests like
  invalidate, page request and response. It would be absurd for the
  vhost-iommu driver to translate generic invalidation requests from the
  guest into arch-specific commands with special opcodes, when the next
  step is calling the IOMMU driver which does that for free.

During previous discussions we came up with generic invalidations that
could fit both Arm and x86 [1][2]. The only difference was the ASID
(called archid/id in those proposals) which VT-d didn't need. Could we try
to build on that?

[1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161
[2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html

Thanks,
Jean
  
Jason Gunthorpe March 9, 2023, 2:48 p.m. UTC | #2
On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:

> Although we can keep the alloc and hardware info separate for each IOMMU
> architecture, we should try to come up with common invalidation methods.

The invalidation language is tightly linked to the actual cache block
and cache tag in the IOMMU HW design. Generality will loose or
obfuscate the necessary specificity that is required for creating real
vIOMMUs.

Further, invalidation is a fast path, it is crazy to take a vIOMMU of
a real HW receving a native invalidation request, mangle it to some
obfuscated kernel version and then de-mangle it again in the kernel
driver. IMHO ideally qemu will simply point the invalidation at the
WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we
should check more into this)

The purpose of these interfaces is to support high performance full
functionality vIOMMUs of the real HW, we should not loose sight of
that goal.

We are actually planning to go futher and expose direct invalidation
work queues complete with HW doorbells to userspace. This obviously
must be in native HW format.

Nicolin, I think we should tweak the uAPI here so that the
invalidation opaque data has a format tagged on its own, instead of
re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
tag and also a virtio-viommu invalidate type tag.

This will allow Jean to put the invalidation decoding in the iommu
drivers if we think that is the right direction. virtio can
standardize it as a "HW format".

> Ideally I'd like something like this for vhost-iommu:
> 
> * slow path through userspace for complex requests like attach-table and
>   probe, where the VMM can decode arch-specific information and translate
>   them to iommufd and vhost-iommu ioctls to update the configuration.
> 
> * fast path within the kernel for performance-critical requests like
>   invalidate, page request and response. It would be absurd for the
>   vhost-iommu driver to translate generic invalidation requests from the
>   guest into arch-specific commands with special opcodes, when the next
>   step is calling the IOMMU driver which does that for free.

Someone has to do the conversion. If you don't think virito should do
it then I'd be OK to add a type tag for virtio format invalidation and
put it in the IOMMU driver.

But given virtio overall already has to know *alot* about how the HW
it is wrapping works I don't think it is necessarily absurd for virtio
to do the conversion. I'd like to evaluate this in patches in context
with how much other unique HW code ends up in kernel-side vhost-iommu.

However, I don't know the rational for virtio-viommu, it seems like a
strange direction to me. All the iommu drivers have native command
queues. ARM and AMD are both supporting native command queues directly
in the guest, complete with a direct guest MMIO doorbell ring.

If someone wants to optimize this I'd think the way to do it is to use
virtio like techniques to put SW command queue processing in the
kernel iommu driver and continue to use the HW native interface in the
VM.

What benifit comes from replacing the HW native interface with virtio?
Especially on ARM where the native interface is pretty clean?

> During previous discussions we came up with generic invalidations that
> could fit both Arm and x86 [1][2]. The only difference was the ASID
> (called archid/id in those proposals) which VT-d didn't need. Could we try
> to build on that?

IMHO this was just unioning all the different invalidation types
together. It makes sense for something like virtio but it is
illogical/obfuscated as a user/kernel interface since it still
requires a userspace HW driver to understand what subset of the
invalidations are used on the actual HW.

Jason
  
Shameerali Kolothum Thodi March 9, 2023, 3:26 p.m. UTC | #3
> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org]
> Sent: 09 March 2023 13:42
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: jgg@nvidia.com; robin.murphy@arm.com; will@kernel.org;
> eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com;
> joro@8bytes.org; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>;
> linux-arm-kernel@lists.infradead.org; iommu@lists.linux.dev;
> linux-kernel@vger.kernel.org; yi.l.liu@intel.com
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
> 
> Hi Nicolin,
> 
> On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
> > Add the following data structures for corresponding ioctls:
> >                iommu_hwpt_arm_smmuv3 =>
> IOMMUFD_CMD_HWPT_ALLOC
> >     iommu_hwpt_invalidate_arm_smmuv3 =>
> IOMMUFD_CMD_HWPT_INVALIDATE
> >
> > Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and
> IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
> > to the header and corresponding type/size arrays.
> >
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> > +/**
> > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table
> data
> > + *
> > + * @flags: page table entry attributes
> > + * @s2vmid: Virtual machine identifier
> > + * @s1ctxptr: Stage-1 context descriptor pointer
> > + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
> > + * @s1fmt: Stage-1 Format
> > + * @s1dss: Default substream
> > + */
> > +struct iommu_hwpt_arm_smmuv3 {
> > +#define IOMMU_SMMUV3_FLAG_S2	(1 << 0) /* if unset, stage-1 */
> 
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields
> 
> > +#define IOMMU_SMMUV3_FLAG_VMID	(1 << 1) /* vmid override */
> 
> Doesn't this break isolation?  The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device

On platforms that supports BTM [1], we may need the VMID allocated by KVM.
But again getting that from user pace doesn't look safe. I have attempted to revise
the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].

But the problem is getting the KVM instance associated with the device. Currently I am 
going through the VFIO layer to retrieve the KVM instance(vfio_device->kvm).

On the previous RFC discussion thread[3], Jean has proposed,

" In the new design we can require from the start that creating a nesting IOMMU
container through /dev/iommu *must* come with a KVM context, that way
we're sure to reuse the existing VMID. "

Is that something we can still do or there is a better way to handle this now?

Thanks,
Shameer


1. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m478f7e7d5dcb729e02721beda35efa12c1d20707
2. https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting-arm-btm-v2
3. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m11cde7534943ea7cf35f534cb809a023eabd9da3
  
Jason Gunthorpe March 9, 2023, 3:40 p.m. UTC | #4
On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote:

> On platforms that supports BTM [1], we may need the VMID allocated by KVM.
> But again getting that from user pace doesn't look safe. I have attempted to revise
> the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].

Gurk

> " In the new design we can require from the start that creating a nesting IOMMU
> container through /dev/iommu *must* come with a KVM context, that way
> we're sure to reuse the existing VMID. "

I've been dreading this but yes I execpt we will eventually need to
connect kvm and iommufd together. The iommu driver can receive a kvm
pointer as part of the alloc domain operation to do any setups like
this.

If there is no KVM it should either fail to setup the domain or setup
a domain disconnected from KVM.

If IOMMU HW and KVM HW are using the same ID number space then
arguably the two kernel drivers need to use a shared ID allocator in
the arch, regardless of what iommufd/etc does. Using KVM should not be
mandatory for iommufd.

For ARM cases where there is no shared VMID space with KVM, the ARM
VMID should be somehow assigned to the iommfd_ctx itself and the alloc
domain op should receive it from there.

Nicolin, that seems to be missing in this series? I'm not entirely
sure how to elegantly code it :\

Jason
  
Shameerali Kolothum Thodi March 9, 2023, 3:51 p.m. UTC | #5
> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 09 March 2023 15:40
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen
> <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org;
> eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com;
> joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
> 
> On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > On platforms that supports BTM [1], we may need the VMID allocated by
> KVM.
> > But again getting that from user pace doesn't look safe. I have attempted
> to revise
> > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].
> 
> Gurk
> 
> > " In the new design we can require from the start that creating a nesting
> IOMMU
> > container through /dev/iommu *must* come with a KVM context, that way
> > we're sure to reuse the existing VMID. "
> 
> I've been dreading this but yes I execpt we will eventually need to
> connect kvm and iommufd together. The iommu driver can receive a kvm
> pointer as part of the alloc domain operation to do any setups like
> this.

That will make life easier :)
 
> If there is no KVM it should either fail to setup the domain or setup
> a domain disconnected from KVM.
> 

If no KVM the SMMUv3 can fall back to its internal VMID allocation I guess.
And my intention was to use KVM VMID only if the platform supports
BTM.

> If IOMMU HW and KVM HW are using the same ID number space then
> arguably the two kernel drivers need to use a shared ID allocator in
> the arch, regardless of what iommufd/etc does. Using KVM should not be
> mandatory for iommufd.
> 
> For ARM cases where there is no shared VMID space with KVM, the ARM
> VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> domain op should receive it from there.

Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc
doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. 

Thanks,
Shameer

> Nicolin, that seems to be missing in this series? I'm not entirely
> sure how to elegantly code it :\
> 
> Jason
  
Jason Gunthorpe March 9, 2023, 3:59 p.m. UTC | #6
On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi wrote:

> > For ARM cases where there is no shared VMID space with KVM, the ARM
> > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > domain op should receive it from there.
> 
> Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc
> doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. 

When we talk about exposing an SMMUv3 IOMMU CMDQ directly to userspace then
VMID is the security token that protects it.

So in that environment every domain under the same iommufd should
share the same VMID so that the CMDQ's also share the same VMID.

I expect this to be a common sort of requirement as we will see
userspace command queues in the other HW as well.

So, I suppose the answer for now is that ARM SMMUv3 should just
allocate one VMID per iommu_domain and there should be no VMID in the
uapi at all.

Moving all iommu_domains to share the same VMID is a future patch.

Though.. I have no idea how vVMID is handled in the SMMUv3
architecture. I suppose the guest IOMMU HW caps are set in a way that
it knows it does not have VMID?

Jason
  
Shameerali Kolothum Thodi March 9, 2023, 4:07 p.m. UTC | #7
> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 09 March 2023 16:00
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen
> <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org;
> eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com;
> joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com
> Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> for ARM SMMUv3
> 
> On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > domain op should receive it from there.
> >
> > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> alloc
> > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
> 
> When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> userspace then
> VMID is the security token that protects it.
> 
> So in that environment every domain under the same iommufd should
> share the same VMID so that the CMDQ's also share the same VMID.
> 
> I expect this to be a common sort of requirement as we will see
> userspace command queues in the other HW as well.
> 
> So, I suppose the answer for now is that ARM SMMUv3 should just
> allocate one VMID per iommu_domain and there should be no VMID in the
> uapi at all.
> 
> Moving all iommu_domains to share the same VMID is a future patch.
> 
> Though.. I have no idea how vVMID is handled in the SMMUv3
> architecture. I suppose the guest IOMMU HW caps are set in a way that
> it knows it does not have VMID?

I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.

Thanks,
Shameer
  
Jean-Philippe Brucker March 9, 2023, 6:26 p.m. UTC | #8
On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
> 
> > Although we can keep the alloc and hardware info separate for each IOMMU
> > architecture, we should try to come up with common invalidation methods.
> 
> The invalidation language is tightly linked to the actual cache block
> and cache tag in the IOMMU HW design.

Concretely though, what are the incompatibilities between the HW designs?
They all need to remove a range of TLB entries, using some address space
tag. But if there is an actual difference I do need to know.

> Generality will loose or
> obfuscate the necessary specificity that is required for creating real
> vIOMMUs.
> 
> Further, invalidation is a fast path, it is crazy to take a vIOMMU of
> a real HW receving a native invalidation request, mangle it to some
> obfuscated kernel version and then de-mangle it again in the kernel
> driver. IMHO ideally qemu will simply point the invalidation at the
> WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we
> should check more into this)

Avoiding copying a few bytes won't make up for the extra context switches
to userspace. An emulated IOMMU can easily decode commands and translate
them to generic kernel structures, in a handful of CPU cycles, just like
they decode STEs. It's what they do, and it's the opposite of obfuscation.

> 
> The purpose of these interfaces is to support high performance full
> functionality vIOMMUs of the real HW, we should not loose sight of
> that goal.
> 
> We are actually planning to go futher and expose direct invalidation
> work queues complete with HW doorbells to userspace. This obviously
> must be in native HW format.

Doesn't seem relevant since direct access to command queue wouldn't use
this struct.

> 
> Nicolin, I think we should tweak the uAPI here so that the
> invalidation opaque data has a format tagged on its own, instead of
> re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> tag and also a virtio-viommu invalidate type tag.
> 
> This will allow Jean to put the invalidation decoding in the iommu
> drivers if we think that is the right direction. virtio can
> standardize it as a "HW format".
> 
> > Ideally I'd like something like this for vhost-iommu:
> > 
> > * slow path through userspace for complex requests like attach-table and
> >   probe, where the VMM can decode arch-specific information and translate
> >   them to iommufd and vhost-iommu ioctls to update the configuration.
> > 
> > * fast path within the kernel for performance-critical requests like
> >   invalidate, page request and response. It would be absurd for the
> >   vhost-iommu driver to translate generic invalidation requests from the
> >   guest into arch-specific commands with special opcodes, when the next
> >   step is calling the IOMMU driver which does that for free.
> 
> Someone has to do the conversion. If you don't think virito should do
> it then I'd be OK to add a type tag for virtio format invalidation and
> put it in the IOMMU driver.

Implementing two invalidation formats in each IOMMU driver does not seem
practical.

> 
> But given virtio overall already has to know *alot* about how the HW
> it is wrapping works I don't think it is necessarily absurd for virtio
> to do the conversion. I'd like to evaluate this in patches in context
> with how much other unique HW code ends up in kernel-side vhost-iommu.

Ideally none. I'd rather leave those, attach and probe, in userspace and
if possible compatible with iommufd to avoid register decoding. 

> 
> However, I don't know the rational for virtio-viommu, it seems like a
> strange direction to me.

A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
all vendor IOMMUs, new architectures get vIOMMU mostly for free, and vhost
provides a faster path. Also the ability to optimize paravirtual
interfaces for things like combined invalidation (IOTLB+ATC) or, later,
nested page requests.

For a while the main vIOMMU use-case was assignment to guest userspace,
mainly DPDK, which works great with a generic and slow map/unmap
interface. Since vSVA is still a niche use-case, and nesting without page
faults requires pinning the whole guest memory, map/unmap still seems more
desirable to me. But there is some renewed interest in supporting page
tables with virtio-iommu for the reasons above.

> All the iommu drivers have native command
> queues. ARM and AMD are both supporting native command queues directly
> in the guest, complete with a direct guest MMIO doorbell ring.

Arm SMMUv3 mandates a single global command queue (SMMUv2 uses registers).
An SMMUv3 can optionally implement multiple command queues, though I don't
know if they can be safely assigned to guests. For a lot of SMMUv3
implementations that have a single queue and for other architectures, we
can do better than hardware emulation.

> 
> If someone wants to optimize this I'd think the way to do it is to use
> virtio like techniques to put SW command queue processing in the
> kernel iommu driver and continue to use the HW native interface in the
> VM.

I didn't get which kernel this is.

> 
> What benifit comes from replacing the HW native interface with virtio?
> Especially on ARM where the native interface is pretty clean?
> 
> > During previous discussions we came up with generic invalidations that
> > could fit both Arm and x86 [1][2]. The only difference was the ASID
> > (called archid/id in those proposals) which VT-d didn't need. Could we try
> > to build on that?
> 
> IMHO this was just unioning all the different invalidation types
> together. It makes sense for something like virtio but it is
> illogical/obfuscated as a user/kernel interface since it still
> requires a userspace HW driver to understand what subset of the
> invalidations are used on the actual HW.

As above, decoding arch-specific structures into generic ones is what an
emulated IOMMU does, and it doesn't make a performance difference in which
format it forwards that to the kernel. The host IOMMU driver checks the
guest request and copies them into the command queue. Whether that request
comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
some generic structure, does not make a difference.

Thanks,
Jean
  
Jason Gunthorpe March 9, 2023, 9:01 p.m. UTC | #9
On Thu, Mar 09, 2023 at 06:26:59PM +0000, Jean-Philippe Brucker wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
> > 
> > > Although we can keep the alloc and hardware info separate for each IOMMU
> > > architecture, we should try to come up with common invalidation methods.
> > 
> > The invalidation language is tightly linked to the actual cache block
> > and cache tag in the IOMMU HW design.
> 
> Concretely though, what are the incompatibilities between the HW designs?
> They all need to remove a range of TLB entries, using some address space
> tag. But if there is an actual difference I do need to know.

For instance the address space tags and the cache entires they match
to are wildly different.

ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
and incorporates the PASID into the cache tag.

AMD uses something called a DID that covers a different set of stuff
than the Intel DID, and it doesn't seem to work for nesting. AMD uses
PASID as the primary nested cache tag.

Superficially you can say all three have an ASID and you can have an
invalidate ASID Operation and make it "look" the same, but the actual
behavior is totally ill defined and the whole thing is utterly
obfuscated as to what does it actually MEAN.

But this doesn't matter for virtio. You have already got a spec that
defines invalidation in terms of virtio objects that sort of match
things like iommu_domains. I hope the virtio
VIRTIO_IOMMU_INVAL_S_DOMAIN is very well defined as to exactly what
objects a DOMAIN match applies to. The job of the hypervisor is to
translate that to whatever invalidation(s) the real HW requires.

ie virito is going to say "invalidate this domain" and expect the
hypervisor to spew it to every attached PASID if that is what the HW
requires (eg AMD), or do a single ASID invalidation (Intel, sometimes)

But when a vIOMMU gets a vDID or vPASID invalidation command it
doesn't mean the same thing as virtio. The driver must invalidate
exactly what the vIOMMU programming model says to invalidate because
the guest is going to spew more invalidations to cover what it
needs. Over invalidation would be a performance problem.

Exposing these subtle differences to userspace is necessary. What I do
not want is leaking those differences through an ill-defined "generic"
interface.

Even more importantly Intel and ARM should not have to fight about the
subtle implementation specific details of the specification of the
"generic" interface. If Intel needs something dumb to make their
viommu work well then they should simply be able to do it. I don't
want to spend 6 months of pointless arguing about language details in
an unnecessary "generic" spec.

> > The purpose of these interfaces is to support high performance full
> > functionality vIOMMUs of the real HW, we should not loose sight of
> > that goal.
> > 
> > We are actually planning to go futher and expose direct invalidation
> > work queues complete with HW doorbells to userspace. This obviously
> > must be in native HW format.
> 
> Doesn't seem relevant since direct access to command queue wouldn't use
> this struct.

The point is our design direction with iommufd is to expose the raw HW
to userspace, not to obfuscate it with ill defined generalizations.

> > Someone has to do the conversion. If you don't think virito should do
> > it then I'd be OK to add a type tag for virtio format invalidation and
> > put it in the IOMMU driver.
> 
> Implementing two invalidation formats in each IOMMU driver does not seem
> practical.

I don't see why.

The advantage of the kernel side is that the implementation is not
strong ABI. If we want to adjust and review the virtio invalidation
path as new HW comes along we can, so long as it is only in the
kernel.

On the other hand if we mess up the uABI for iommufd we are stuck with
it.

The safest and best uABI for iommufd is the HW native uABI because it,
almost by definition, cannot be wrong.

Anyhow, I'm still not very convinced adapting to virtio invalidation
format should be in iommu drivers. I think what you end up with for
virtio is that Intel/AMD have some nice common code to invalidate an
iommu_domain address range (probably even the existing invalidation
interface), and SMMUv3 is just totally different and special.

This is because SMMUv3 has no option to keep the PASID table in the
hypervisor so you are sadly forced to expose both the native ASID and
native PASID caches to the virtio protocol.

Given that the VM virtio driver has to have SMMUv3 specific code to
handle the CD table it must get, I don't see the problem with also
having SMMUv3 specific code in the hypervisor virtio driver to handle
invalidating based on the CD table.

Really, I want to see patches implementing all of this before we make
any decision on what the ops interface is for virtio-iommu kernel
side.

> > However, I don't know the rational for virtio-viommu, it seems like a
> > strange direction to me.
> 
> A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> all vendor IOMMUs, new architectures get vIOMMU mostly for free,

So your argument is you can implement a simple map/unmap API riding
on the common IOMMU API and this is portable?

Seems sensible, but that falls apart pretty quickly when we talk about
nesting.. I don't think we can avoid VMM components to set this up, so
it stops being portable. At that point I'm back to asking why not use
the real HW model?

> > All the iommu drivers have native command
> > queues. ARM and AMD are both supporting native command queues directly
> > in the guest, complete with a direct guest MMIO doorbell ring.
> 
> Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> registers). An SMMUv3 can optionally implement multiple command
> queues, though I don't know if they can be safely assigned to
> guests.

It is not standardized by ARM, but it can (and has) been done.

> For a lot of SMMUv3 implementations that have a single queue and for
> other architectures, we can do better than hardware emulation.

How is using a SW emulated virtio formatted queue better than using a
SW emulated SMMUv3 ECMDQ?

The vSMMUv3 driver controls what capabilites are shown to the guest it
can definitely create a ECMDQ enabled device and do something like
assign the 2ndary ECMDQs to hypervisor kernel SW queues the same way
virito does.

I don't think there is a very solid argument that virtio-iommu is
necessary to get faster invalidation.

> > If someone wants to optimize this I'd think the way to do it is to use
> > virtio like techniques to put SW command queue processing in the
> > kernel iommu driver and continue to use the HW native interface in the
> > VM.
> 
> I didn't get which kernel this is.

hypervisor kernel.

> > IMHO this was just unioning all the different invalidation types
> > together. It makes sense for something like virtio but it is
> > illogical/obfuscated as a user/kernel interface since it still
> > requires a userspace HW driver to understand what subset of the
> > invalidations are used on the actual HW.
> 
> As above, decoding arch-specific structures into generic ones is what an
> emulated IOMMU does,

No, it is what virtio wants to do. We are deliberately trying not to
do that for real accelerated HW vIOMMU emulators.

> and it doesn't make a performance difference in which
> format it forwards that to the kernel. The host IOMMU driver checks the
> guest request and copies them into the command queue. Whether that request
> comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> some generic structure, does not make a difference.

It is not the structure layouts that matter!

It is the semantic meaning of each request, on each unique piece of
hardware. We actually want to leak the subtle semantic differences to
userspace.

Doing that and continuing to give them the same command label is
exactly the kind of obfuscated ill defined "generic" interface I do
not want.

Jason
  
Nicolin Chen March 10, 2023, 4:50 a.m. UTC | #10
On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:

> Nicolin, I think we should tweak the uAPI here so that the
> invalidation opaque data has a format tagged on its own, instead of
> re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> tag and also a virtio-viommu invalidate type tag.

The invalidation tage is shared with the hwpt allocation. Does
it mean that virtio-iommu won't have it's own allocation tag?

Thanks
Nic
  
Nicolin Chen March 10, 2023, 5:04 a.m. UTC | #11
Hi Jeans,

Allow me to partially reply your email:

On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:

> > +struct iommu_hwpt_arm_smmuv3 {
> > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */
> 
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields

I should have probably put more description for this flag. It
is used to allocate a stage-2 domain for a nested translation
setup. The default allocation for a kernel-managed domain will
allocate an S1 format of IO page table, at ARM_SMMU_DOMAIN_S1
stage. But a nested kernel-managed domain needs an S2 format,
at ARM_SMMU_DOMAIN_S2.
	
So the whole structure seems to only provide stage-1 info but
it's used for both stages. And a stage-2 allocation will only
need s2vmid if VMID flag is set (explaining below).

> > +#define IOMMU_SMMUV3_FLAG_VMID       (1 << 1) /* vmid override */
> 
> Doesn't this break isolation?  The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device

This flag isn't mature yet. I kept it from my internal RFC to
see if we can have a better solution. There are use cases on
certain platforms where the VMIDs across all devices in the
same VM need to be aligned.

Thanks
Nic
  
Nicolin Chen March 10, 2023, 5:18 a.m. UTC | #12
On Thu, Mar 09, 2023 at 11:40:16AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote:
> 
> > On platforms that supports BTM [1], we may need the VMID allocated by KVM.
> > But again getting that from user pace doesn't look safe. I have attempted to revise
> > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2].
> 
> Gurk
> 
> > " In the new design we can require from the start that creating a nesting IOMMU
> > container through /dev/iommu *must* come with a KVM context, that way
> > we're sure to reuse the existing VMID. "
> 
> I've been dreading this but yes I execpt we will eventually need to
> connect kvm and iommufd together. The iommu driver can receive a kvm
> pointer as part of the alloc domain operation to do any setups like
> this.
> 
> If there is no KVM it should either fail to setup the domain or setup
> a domain disconnected from KVM.
> 
> If IOMMU HW and KVM HW are using the same ID number space then
> arguably the two kernel drivers need to use a shared ID allocator in
> the arch, regardless of what iommufd/etc does. Using KVM should not be
> mandatory for iommufd.
> 
> For ARM cases where there is no shared VMID space with KVM, the ARM
> VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> domain op should receive it from there.
> 
> Nicolin, that seems to be missing in this series? I'm not entirely
> sure how to elegantly code it :\

Yea, it's missing. The VMID thing is supposed to be a sneak peek
of my next VCMDQ solution. Now it seems that BTM needs this too.

Remember that my previous VCMDQ series had a big complication to
share VMID across the passthrough devices in the same VM? During
that patch review, we concluded that IOMMUFD would simply align
VMIDs using a unified ctx ID or so, IIRC.

Thanks
Nic
  
Nicolin Chen March 10, 2023, 5:26 a.m. UTC | #13
On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote:
> External email: Use caution opening links or attachments
> 
> 
> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 09 March 2023 16:00
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen
> > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org;
> > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com;
> > joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com
> > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> > for ARM SMMUv3
> >
> > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> > wrote:
> >
> > > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > > domain op should receive it from there.
> > >
> > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> > alloc
> > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
> >
> > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> > userspace then
> > VMID is the security token that protects it.
> >
> > So in that environment every domain under the same iommufd should
> > share the same VMID so that the CMDQ's also share the same VMID.
> >
> > I expect this to be a common sort of requirement as we will see
> > userspace command queues in the other HW as well.
> >
> > So, I suppose the answer for now is that ARM SMMUv3 should just
> > allocate one VMID per iommu_domain and there should be no VMID in the
> > uapi at all.
> >
> > Moving all iommu_domains to share the same VMID is a future patch.
> >
> > Though.. I have no idea how vVMID is handled in the SMMUv3
> > architecture. I suppose the guest IOMMU HW caps are set in a way that
> > it knows it does not have VMID?
> 
> I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.

Yea, a vmid is only allocated in an S2 domain allocation. So,
a guest allocating only S1 domains always sets VMID=0. Yet, I
think that the hypervisor or some where in host kernel should
replace the VMID=0 with a unified VMID.

Thanks
Nic
  
Nicolin Chen March 10, 2023, 5:36 a.m. UTC | #14
On Thu, Mar 09, 2023 at 09:26:57PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > > -----Original Message-----
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > Sent: 09 March 2023 16:00
> > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen
> > > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org;
> > > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com;
> > > joro@8bytes.org; linux-arm-kernel@lists.infradead.org;
> > > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com
> > > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
> > > for ARM SMMUv3
> > >
> > > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > >
> > > > > For ARM cases where there is no shared VMID space with KVM, the ARM
> > > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc
> > > > > domain op should receive it from there.
> > > >
> > > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain
> > > alloc
> > > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID.
> > >
> > > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to
> > > userspace then
> > > VMID is the security token that protects it.
> > >
> > > So in that environment every domain under the same iommufd should
> > > share the same VMID so that the CMDQ's also share the same VMID.
> > >
> > > I expect this to be a common sort of requirement as we will see
> > > userspace command queues in the other HW as well.
> > >
> > > So, I suppose the answer for now is that ARM SMMUv3 should just
> > > allocate one VMID per iommu_domain and there should be no VMID in the
> > > uapi at all.
> > >
> > > Moving all iommu_domains to share the same VMID is a future patch.
> > >
> > > Though.. I have no idea how vVMID is handled in the SMMUv3
> > > architecture. I suppose the guest IOMMU HW caps are set in a way that
> > > it knows it does not have VMID?
> > 
> > I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID.
> 
> Yea, a vmid is only allocated in an S2 domain allocation. So,
> a guest allocating only S1 domains always sets VMID=0. Yet, I
> think that the hypervisor or some where in host kernel should
> replace the VMID=0 with a unified VMID.

Ah, I just recall a conversation with Jason that a VM should only
have one S2 domain. In that case, the VMID is already unified?

Thanks
Nic
  
Eric Auger March 10, 2023, 11:33 a.m. UTC | #15
Hi,
On 3/9/23 14:42, Jean-Philippe Brucker wrote:
> Hi Nicolin,
>
> On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote:
>> Add the following data structures for corresponding ioctls:
>>                iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC
>>     iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE
>>
>> Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1
>> to the header and corresponding type/size arrays.
>>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> +/**
>> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
>> + *
>> + * @flags: page table entry attributes
>> + * @s2vmid: Virtual machine identifier
>> + * @s1ctxptr: Stage-1 context descriptor pointer
>> + * @s1cdmax: Number of CDs pointed to by s1ContextPtr
>> + * @s1fmt: Stage-1 Format
>> + * @s1dss: Default substream
>> + */
>> +struct iommu_hwpt_arm_smmuv3 {
>> +#define IOMMU_SMMUV3_FLAG_S2	(1 << 0) /* if unset, stage-1 */
> I don't understand the purpose of this flag, since the structure only
> provides stage-1 configuration fields
>
>> +#define IOMMU_SMMUV3_FLAG_VMID	(1 << 1) /* vmid override */
> Doesn't this break isolation?  The VMID space is global for the SMMU, so
> the guest could access cached mappings of another device
>
>> +	__u64 flags;
>> +	__u32 s2vmid;
>> +	__u32 __reserved;
>> +	__u64 s1ctxptr;
>> +	__u64 s1cdmax;
>> +	__u64 s1fmt;
>> +	__u64 s1dss;
>> +};
>> +
>
>> +/**
>> + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
>> + * @flags: boolean attributes of cache invalidation command
>> + * @opcode: opcode of cache invalidation command
>> + * @ssid: SubStream ID
>> + * @granule_size: page/block size of the mapping in bytes
>> + * @range: IOVA range to invalidate
>> + */
>> +struct iommu_hwpt_invalidate_arm_smmuv3 {
>> +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF	(1 << 0)
>> +	__u64 flags;
>> +	__u8 opcode;
>> +	__u8 padding[3];
>> +	__u32 asid;
>> +	__u32 ssid;
>> +	__u32 granule_size;
>> +	struct iommu_iova_range range;
>> +};
> Although we can keep the alloc and hardware info separate for each IOMMU
> architecture, we should try to come up with common invalidation methods.
>
> It matters because things like vSVA, or just efficient dynamic mappings,
> will require optimal invalidation latency. A paravirtual interface like
> vhost-iommu can help with that, as the host kernel will handle guest
> invalidations directly instead of doing a round-trip to host userspace
> (and we'll likely want the same path for PRI.)
>
> Supporting HW page tables for a common PV IOMMU does require some
> architecture-specific knowledge, but invalidation messages contain roughly
> the same information on all architectures. The PV IOMMU won't include
> command opcodes for each possible architecture if one generic command does
> the same job.
>
> Ideally I'd like something like this for vhost-iommu:
>
> * slow path through userspace for complex requests like attach-table and
>   probe, where the VMM can decode arch-specific information and translate
>   them to iommufd and vhost-iommu ioctls to update the configuration.
>
> * fast path within the kernel for performance-critical requests like
>   invalidate, page request and response. It would be absurd for the
>   vhost-iommu driver to translate generic invalidation requests from the
>   guest into arch-specific commands with special opcodes, when the next
>   step is calling the IOMMU driver which does that for free.
>
> During previous discussions we came up with generic invalidations that
> could fit both Arm and x86 [1][2]. The only difference was the ASID
> (called archid/id in those proposals) which VT-d didn't need. Could we try
> to build on that?

I do agree with Jean. We spent a lot of efforts all together to define
this generic invalidation API and if there is compelling reason that
prevents from using it, we should try to reuse it.

Thanks

Eric
>
> [1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161
> [2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html
>
> Thanks,
> Jean
>
  
Jean-Philippe Brucker March 10, 2023, 12:16 p.m. UTC | #16
On Thu, Mar 09, 2023 at 05:01:15PM -0400, Jason Gunthorpe wrote:
> > Concretely though, what are the incompatibilities between the HW designs?
> > They all need to remove a range of TLB entries, using some address space
> > tag. But if there is an actual difference I do need to know.
> 
> For instance the address space tags and the cache entires they match
> to are wildly different.
> 
> ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
> and incorporates the PASID into the cache tag.
> 
> AMD uses something called a DID that covers a different set of stuff
> than the Intel DID, and it doesn't seem to work for nesting. AMD uses
> PASID as the primary nested cache tag.

Thanks, we'll look into that


> This is because SMMUv3 has no option to keep the PASID table in the
> hypervisor so you are sadly forced to expose both the native ASID and
> native PASID caches to the virtio protocol.

It is possible to keep the PASID table in the host, but you need a way to
allocate GPA since the SMMU accesses it after stage-2 translation. I think
that necessarily requires a PV interface, but you could look into it.
Anyway, even with that, ATC invalidations take a PASID.

> 
> Given that the VM virtio driver has to have SMMUv3 specific code to
> handle the CD table it must get, I don't see the problem with also
> having SMMUv3 specific code in the hypervisor virtio driver to handle
> invalidating based on the CD table.

There isn't much we can't do, I'm just hoping to build something
straightforward instead of having to work around awkward interfaces


> > A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> > all vendor IOMMUs, new architectures get vIOMMU mostly for free,
> 
> So your argument is you can implement a simple map/unmap API riding
> on the common IOMMU API and this is portable?
> 
> Seems sensible, but that falls apart pretty quickly when we talk about
> nesting.. I don't think we can avoid VMM components to set this up, so
> it stops being portable. At that point I'm back to asking why not use
> the real HW model?

A single VMM component that shovels data from the virtqueue to the kernel
API and back, rather than four different hardware emulations, four
different queues, four different device tables. It's obviously better for
VMMs that don't do full-system emulation like QEMU, especially as they
generally already implement a virtio transport. Smaller attack surface,
fewer bugs.

The VMM developer gets a multi-platform vIOMMU without having to study all
the different architecture manuals. There is a small amount of HW specific
data in there, but it only relates to table formats. 

Ideally it wouldn't need any HW knowledge, but that would requires the
APIs to be aligned: instead of ID registers we pass plain features, and
invalidations don't require HW specific opcodes. Otherwise there is going
to be a layer of glue everywhere, which is what I'm trying to avoid here.

> 
> > > All the iommu drivers have native command
> > > queues. ARM and AMD are both supporting native command queues directly
> > > in the guest, complete with a direct guest MMIO doorbell ring.
> > 
> > Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> > registers). An SMMUv3 can optionally implement multiple command
> > queues, though I don't know if they can be safely assigned to
> > guests.
> 
> It is not standardized by ARM, but it can (and has) been done.
> 
> > For a lot of SMMUv3 implementations that have a single queue and for
> > other architectures, we can do better than hardware emulation.
> 
> How is using a SW emulated virtio formatted queue better than using a
> SW emulated SMMUv3 ECMDQ?

We don't need to repeat it for all IOMMU architectures, not emulate a new
queue in the kernel. The first motivator for virtio-iommu was avoiding to
emulate hardware in the kernel. The SMMU maintainer saw how painful that
was to do for the GIC, saw that there is a virtualization queue readily
available in vhost and, well, it just made sense. Still does.


> > As above, decoding arch-specific structures into generic ones is what an
> > emulated IOMMU does,
> 
> No, it is what virtio wants to do. We are deliberately trying not to
> do that for real accelerated HW vIOMMU emulators.

Yes there is a line somewhere, and I'd prefer it to be the page table.
Given how many possible hardware combinations exist and how many more will
show up, it would be good to abstract things where possible.

> 
> > and it doesn't make a performance difference in which
> > format it forwards that to the kernel. The host IOMMU driver checks the
> > guest request and copies them into the command queue. Whether that request
> > comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> > some generic structure, does not make a difference.
> 
> It is not the structure layouts that matter!
> 
> It is the semantic meaning of each request, on each unique piece of
> hardware. We actually want to leak the subtle semantic differences to
> userspace.

These are hardware emulations, of course they have to know about hardware
semantics. The QEMU IOMMUs can work in TCG mode where they decode and
handle everything themselves.

Thanks,
Jean
  
Jason Gunthorpe March 10, 2023, 12:51 p.m. UTC | #17
On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote:

> I do agree with Jean. We spent a lot of efforts all together to define
> this generic invalidation API and if there is compelling reason that
> prevents from using it, we should try to reuse it.

That's the compelling reason in a nutshell right there.

Alot of time was invested to create something that might be
general. We still don't know if it is well defined and general. Even
more time is going to be required on it before it could go forward. In
future more time will be needed for every future HW to try and fit
into it. We don't even know if it will scale to future HW. Nobody has
even checked what today's POWER and S390 HW need.

vs, this stuff was made in a few days. We know it is correct as a uAPI
since it mirrors the HW and we know it is scalable to different HW
schemes if they come up.

So I don't see a good reason to take a risk on a "general" uAPI. If we
make this wrong it could seriously damage the main goal of iommufd -
to build accelerated vIOMMU models.

Especially since the motivating reason in this thread - use it for
virtio-iommu - doesn't even want to use it as a uAPI!

If we get a vhost-virtio then we can decide what to do in-kernel and
maybe this general API returns as an in-kernel API, I dont know, we
need to see what it is this thing ends up looking like.

Jason
  
Jean-Philippe Brucker March 10, 2023, 12:54 p.m. UTC | #18
On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> 
> > Nicolin, I think we should tweak the uAPI here so that the
> > invalidation opaque data has a format tagged on its own, instead of
> > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > tag and also a virtio-viommu invalidate type tag.
> 
> The invalidation tage is shared with the hwpt allocation. Does
> it mean that virtio-iommu won't have it's own allocation tag?

I'm not entirely sure what you mean by allocation tag. For example with
SMMU, when attaching page tables (SMMUv2), the guest passes an ASID at
allocation, and when it modifies that address space it passes the same
ASID for invalidation. When attaching PASID tables (SMMUv3), it writes the
ASID/PASID in the PASID table, and passes both in the invalidation.

Note that none of this is set in stone. It copies the Linux API we
originally discussed, but we were waiting for progress on that front
before committing to anything. Now we'll probably align to the new API
where possible, leaving out what doesn't work for virtio-iommu.

Thanks,
Jean
  
Jason Gunthorpe March 10, 2023, 12:55 p.m. UTC | #19
On Thu, Mar 09, 2023 at 09:36:18PM -0800, Nicolin Chen wrote:
> > Yea, a vmid is only allocated in an S2 domain allocation. So,
> > a guest allocating only S1 domains always sets VMID=0. Yet, I
> > think that the hypervisor or some where in host kernel should
> > replace the VMID=0 with a unified VMID.
> 
> Ah, I just recall a conversation with Jason that a VM should only
> have one S2 domain. In that case, the VMID is already unified?

Not requried per-say, but yes, most likely qemu would run that way.

But you can't just re-use the VMID however you like. AFAIK the VMID is
the cache tag for the S2 IOPTEs, so every VMID must refer to the same
S2 translation.

You can't mix different S2's with the same VMID.

Thus you are stuck with the single S2 model in qemu if you want to use
a userspace CMDQ.

I suppose that suggests that if KVM supplies the VMID then it is
assigned to a singular S2 iommu_domain also.

Jason
  
Jason Gunthorpe March 10, 2023, 2 p.m. UTC | #20
On Fri, Mar 10, 2023 at 12:54:53PM +0000, Jean-Philippe Brucker wrote:
> On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> > 
> > > Nicolin, I think we should tweak the uAPI here so that the
> > > invalidation opaque data has a format tagged on its own, instead of
> > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > > tag and also a virtio-viommu invalidate type tag.
> > 
> > The invalidation tage is shared with the hwpt allocation. Does
> > it mean that virtio-iommu won't have it's own allocation tag?
> 
> I'm not entirely sure what you mean by allocation tag. 

He means the tag identifying the allocation driver specific data is
the same tag that is passed in to identify the invalidation driver
specific data.

With the notion that the allocation data and invalidation data would
be in the same driver's format.

> Note that none of this is set in stone. It copies the Linux API we
> originally discussed, but we were waiting for progress on that front
> before committing to anything. Now we'll probably align to the new API
> where possible, leaving out what doesn't work for virtio-iommu.

IMHO virtio-iommu should stand alone and make sense with its own
internal object model.

eg I would probably try not to have guests invalidate PASID. Have a
strong ASID model and in most cases have the hypervisor track where
the ASID's are mapped to PASID/etc and rely on the hypervisor to spew
the invalidations to PASID as required. It is more abstracted from the
actual HW for the guest. The guest can simply say it changed an IOPTE
under a certain ASID.

The ugly wrinkle is SMMUv3 but perhaps your idea of allowing the
hypervisor to manage the CD table in guest memory is reasonable.

IMHO it is a missing SMMUv3 HW feature that the CD table doesn't have
the option to be in hypervisor memory. AMD allows both options - so
I'm not sure I would invest a huge amount to make special cases to
support this... Assume a SMMUv3 update might gain the option someday.

Jason
  
Robin Murphy March 10, 2023, 2:52 p.m. UTC | #21
On 2023-03-09 21:01, Jason Gunthorpe wrote:
>> For a lot of SMMUv3 implementations that have a single queue and for
>> other architectures, we can do better than hardware emulation.
> 
> How is using a SW emulated virtio formatted queue better than using a
> SW emulated SMMUv3 ECMDQ?

Since it's not been said, the really big thing is that virtio explicitly 
informs the host whenever the guest maps something. Emulating SMMUv3 
means the host has to chase all the pagetable pointers in guest memory 
and trap writes such that it has visibility of invalid->valid 
transitions and can update the physical shadow pagetable correspondingly.

FWIW we spent quite some time on and off discussing something like 
VT-d's "caching mode", but never found a convincing argument that it was 
a gap which needed filling, since we already had hardware nesting for 
maximum performance and a paravirtualisation option for efficient 
emulation. Thus full SMMUv3 emulation seems to just sit at the bottom as 
the maximum-compatibility option for pushing an unmodified legacy 
bare-metal software stack into a VM where nesting isn't available.

Cheers,
Robin.
  
Jason Gunthorpe March 10, 2023, 3:25 p.m. UTC | #22
On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote:
> On 2023-03-09 21:01, Jason Gunthorpe wrote:
> > > For a lot of SMMUv3 implementations that have a single queue and for
> > > other architectures, we can do better than hardware emulation.
> > 
> > How is using a SW emulated virtio formatted queue better than using a
> > SW emulated SMMUv3 ECMDQ?
> 
> Since it's not been said, the really big thing is that virtio explicitly
> informs the host whenever the guest maps something. Emulating SMMUv3 means
> the host has to chase all the pagetable pointers in guest memory and trap
> writes such that it has visibility of invalid->valid transitions and can
> update the physical shadow pagetable correspondingly.

Sorry, I mean in the context of future virtio-iommu that is providing
nested translation.

eg why would anyone want to use virtio to provide SMMUv3 based HW
accelerated nesting?

Jean suggested that the invalidation flow for virtio-iommu could be
faster because it is in kernel, but I'm saying that we could also make
the SMMUv3 invalidation in-kernel with the same basic technique. (and
actively wondering if we should put more focus on that)

I understand the appeal of the virtio scheme with its current
map/unmap interface.

I could also see some appeal of a simple virtio-iommu SVA that could
point map a CPU page table as an option. The guest already has to know
how to manage these anyhow so it is nicely general.

If iommufd could provide a general cross-driver API to set exactly
that scenario up then VMM code could also be general. That seems
prettty interesting.

But if the plan is to expose more detailed stuff like the CD or GCR3
PASID tables as something the guest has to manipulate and then a bunch
of special invalidation to support that, and VMM code to back it, then
I'm questioning the whole point. We lost the generality.

Just use the normal HW accelerated SMMUv3 nesting model instead.

If virtio-iommu SVA is really important for ARM then I'd suggest
SMMUv3 should gain a new HW capability to allowed the CD table to be
in hypervisor memory so it works consistently for virtio-iommu SVA.

Jason
  
Robin Murphy March 10, 2023, 3:57 p.m. UTC | #23
On 2023-03-10 15:25, Jason Gunthorpe wrote:
> On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote:
>> On 2023-03-09 21:01, Jason Gunthorpe wrote:
>>>> For a lot of SMMUv3 implementations that have a single queue and for
>>>> other architectures, we can do better than hardware emulation.
>>>
>>> How is using a SW emulated virtio formatted queue better than using a
>>> SW emulated SMMUv3 ECMDQ?
>>
>> Since it's not been said, the really big thing is that virtio explicitly
>> informs the host whenever the guest maps something. Emulating SMMUv3 means
>> the host has to chase all the pagetable pointers in guest memory and trap
>> writes such that it has visibility of invalid->valid transitions and can
>> update the physical shadow pagetable correspondingly.
> 
> Sorry, I mean in the context of future virtio-iommu that is providing
> nested translation.

Ah, that's probably me missing the context again.

> eg why would anyone want to use virtio to provide SMMUv3 based HW
> accelerated nesting?
> 
> Jean suggested that the invalidation flow for virtio-iommu could be
> faster because it is in kernel, but I'm saying that we could also make
> the SMMUv3 invalidation in-kernel with the same basic technique. (and
> actively wondering if we should put more focus on that)
> 
> I understand the appeal of the virtio scheme with its current
> map/unmap interface.
> 
> I could also see some appeal of a simple virtio-iommu SVA that could
> point map a CPU page table as an option. The guest already has to know
> how to manage these anyhow so it is nicely general.
> 
> If iommufd could provide a general cross-driver API to set exactly
> that scenario up then VMM code could also be general. That seems
> prettty interesting.

Indeed, I've always assumed the niche for virtio would be that kind of 
in-between use-case using nesting to accelerate simple translation, 
where we plug a guest-owned pagetable into a host-owned context. That 
way the guest retains the simple virtio interface and only needs to 
understand a pagetable format (or as you say, simply share a CPU 
pagetable) without having to care about the nitty-gritty of all the 
IOMMU-specific moving parts around it. For guests that want to get into 
more advanced stuff like managing their own PASID tables, pushing them 
towards "native" nesting probably does make more sense.

> But if the plan is to expose more detailed stuff like the CD or GCR3
> PASID tables as something the guest has to manipulate and then a bunch
> of special invalidation to support that, and VMM code to back it, then
> I'm questioning the whole point. We lost the generality.
> 
> Just use the normal HW accelerated SMMUv3 nesting model instead.
> 
> If virtio-iommu SVA is really important for ARM then I'd suggest
> SMMUv3 should gain a new HW capability to allowed the CD table to be
> in hypervisor memory so it works consistently for virtio-iommu SVA.

Oh, maybe I should have read this far before reasoning the exact same 
thing from scratch... oh well, this time I'm not going to go back and 
edit :)

Thanks,
Robin.
  
Jason Gunthorpe March 10, 2023, 4:03 p.m. UTC | #24
On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote:

> about the nitty-gritty of all the IOMMU-specific moving parts around it. For
> guests that want to get into more advanced stuff like managing their own
> PASID tables, pushing them towards "native" nesting probably does make more
> sense.

IMHO with the simplified virtio model I would say the guest should
not have its own PASID table.

hyper trap to install a PASID and let the hypervisor driver handle
this abstractly. If abstractly is the whole point and benifit then
virtio should lean into that.

This also means virtio protocol doesn't do PASID invalidation. It
invalidates an ASID and the hypervisor takes care of whatever it is
connected to. Very simple and general for the VM.

Adding a S1 iommu_domain op for invalidate address range is perfectly
fine and the virtio kernel hypervisor driver can call it generically.

The primary reason to have guest-owned PASID tables is CC stuff, which
definitely won't be part of virtio-iommu.

Jason
  
Jason Gunthorpe March 10, 2023, 4:06 p.m. UTC | #25
On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> 
> > Nicolin, I think we should tweak the uAPI here so that the
> > invalidation opaque data has a format tagged on its own, instead of
> > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > tag and also a virtio-viommu invalidate type tag.
> 
> The invalidation tage is shared with the hwpt allocation. Does
> it mean that virtio-iommu won't have it's own allocation tag?

We probably shouldn't assume it will

Jason
  
Nicolin Chen March 16, 2023, 12:59 a.m. UTC | #26
On Fri, Mar 10, 2023 at 12:06:18PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> > 
> > > Nicolin, I think we should tweak the uAPI here so that the
> > > invalidation opaque data has a format tagged on its own, instead of
> > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> > > tag and also a virtio-viommu invalidate type tag.
> > 
> > The invalidation tage is shared with the hwpt allocation. Does
> > it mean that virtio-iommu won't have it's own allocation tag?
> 
> We probably shouldn't assume it will

In that case, why do have need an invalidation tag/type on its
own? Can't we use an IOMMU_HWPT_TYPE_VIRTIO tag for allocation
and invalidation together for virtio?

Or did you mean that we should define a flag inside the data
structure like this?

struct iommu_hwpt_invalidate_arm_smmuv3 {
#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF  (1 << 0)
#define IOMMU_SMMUV3_FORMAT_VIRTIO	(1 << 63)
	__u64 flags;
}

Thanks
Nic
  
Tian, Kevin March 17, 2023, 10:04 a.m. UTC | #27
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Friday, March 10, 2023 11:57 PM
> 
> >
> > If iommufd could provide a general cross-driver API to set exactly
> > that scenario up then VMM code could also be general. That seems
> > prettty interesting.
> 
> Indeed, I've always assumed the niche for virtio would be that kind of
> in-between use-case using nesting to accelerate simple translation,
> where we plug a guest-owned pagetable into a host-owned context. That
> way the guest retains the simple virtio interface and only needs to
> understand a pagetable format (or as you say, simply share a CPU
> pagetable) without having to care about the nitty-gritty of all the
> IOMMU-specific moving parts around it. For guests that want to get into
> more advanced stuff like managing their own PASID tables, pushing them
> towards "native" nesting probably does make more sense.
> 

Interesting thing is that we cannot expose both virtio-iommu and
emulated vIOMMU to one guest to choose. then if the guest has
been using virtio-iommu for whatever reason naturally it may
want more advanced features on it too.
  
Tian, Kevin March 17, 2023, 10:10 a.m. UTC | #28
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, March 11, 2023 12:03 AM
> 
> On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote:
> 
> > about the nitty-gritty of all the IOMMU-specific moving parts around it. For
> > guests that want to get into more advanced stuff like managing their own
> > PASID tables, pushing them towards "native" nesting probably does make
> more
> > sense.
> 
> IMHO with the simplified virtio model I would say the guest should
> not have its own PASID table.
> 
> hyper trap to install a PASID and let the hypervisor driver handle
> this abstractly. If abstractly is the whole point and benifit then
> virtio should lean into that.
> 
> This also means virtio protocol doesn't do PASID invalidation. It
> invalidates an ASID and the hypervisor takes care of whatever it is
> connected to. Very simple and general for the VM.

this sounds fair, if ASID here refers a general ID identifying the page
table instead of ARM specific ASID. 😊

but guest still needs to manage the PASID and program PASID into
the assigned device to tag DMA.

> 
> Adding a S1 iommu_domain op for invalidate address range is perfectly
> fine and the virtio kernel hypervisor driver can call it generically.
> 
> The primary reason to have guest-owned PASID tables is CC stuff, which
> definitely won't be part of virtio-iommu.
> 

This fits Intel well.
  
Tian, Kevin March 17, 2023, 10:17 a.m. UTC | #29
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 10, 2023 8:52 PM
> 
> On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote:
> 
> > I do agree with Jean. We spent a lot of efforts all together to define
> > this generic invalidation API and if there is compelling reason that
> > prevents from using it, we should try to reuse it.
> 
> That's the compelling reason in a nutshell right there.
> 
> Alot of time was invested to create something that might be
> general. We still don't know if it is well defined and general. Even
> more time is going to be required on it before it could go forward. In
> future more time will be needed for every future HW to try and fit
> into it. We don't even know if it will scale to future HW. Nobody has
> even checked what today's POWER and S390 HW need.
> 
> vs, this stuff was made in a few days. We know it is correct as a uAPI
> since it mirrors the HW and we know it is scalable to different HW
> schemes if they come up.
> 
> So I don't see a good reason to take a risk on a "general" uAPI. If we
> make this wrong it could seriously damage the main goal of iommufd -
> to build accelerated vIOMMU models.
> 

I'm with this point. We can add a virtio format when it comes.
  

Patch

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 8f9985bddeeb..5e798b2f9a3a 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -173,6 +173,7 @@  iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 static const size_t iommufd_hwpt_alloc_data_size[] = {
 	[IOMMU_HWPT_TYPE_DEFAULT] = 0,
 	[IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_intel_vtd),
+	[IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_arm_smmuv3),
 };
 
 /*
@@ -183,6 +184,8 @@  const u64 iommufd_hwpt_type_bitmaps[] =  {
 	[IOMMU_HW_INFO_TYPE_DEFAULT] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT),
 	[IOMMU_HW_INFO_TYPE_INTEL_VTD] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) |
 					 BIT_ULL(IOMMU_HWPT_TYPE_VTD_S1),
+	[IOMMU_HW_INFO_TYPE_ARM_SMMUV3] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) |
+					  BIT_ULL(IOMMU_HWPT_TYPE_ARM_SMMUV3),
 };
 
 /* Return true if type is supported, otherwise false */
@@ -329,6 +332,7 @@  int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
  */
 static const size_t iommufd_hwpt_invalidate_info_size[] = {
 	[IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_invalidate_intel_vtd),
+	[IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_invalidate_arm_smmuv3),
 };
 
 int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 514db4c26927..0b0097af7c86 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -280,6 +280,7 @@  union ucmd_buffer {
 	 * path.
 	 */
 	struct iommu_hwpt_invalidate_intel_vtd vtd;
+	struct iommu_hwpt_invalidate_arm_smmuv3 smmuv3;
 };
 
 struct iommufd_ioctl_op {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2a6c326391b2..0d5551b1b2be 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -352,10 +352,13 @@  struct iommu_vfio_ioas {
  * enum iommu_hwpt_type - IOMMU HWPT Type
  * @IOMMU_HWPT_TYPE_DEFAULT: default
  * @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_TYPE_ARM_SMMUV3: ARM SMMUv3 stage-1 Context Descriptor
+ *                              table
  */
 enum iommu_hwpt_type {
 	IOMMU_HWPT_TYPE_DEFAULT,
 	IOMMU_HWPT_TYPE_VTD_S1,
+	IOMMU_HWPT_TYPE_ARM_SMMUV3,
 };
 
 /**
@@ -411,6 +414,28 @@  struct iommu_hwpt_intel_vtd {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data
+ *
+ * @flags: page table entry attributes
+ * @s2vmid: Virtual machine identifier
+ * @s1ctxptr: Stage-1 context descriptor pointer
+ * @s1cdmax: Number of CDs pointed to by s1ContextPtr
+ * @s1fmt: Stage-1 Format
+ * @s1dss: Default substream
+ */
+struct iommu_hwpt_arm_smmuv3 {
+#define IOMMU_SMMUV3_FLAG_S2	(1 << 0) /* if unset, stage-1 */
+#define IOMMU_SMMUV3_FLAG_VMID	(1 << 1) /* vmid override */
+	__u64 flags;
+	__u32 s2vmid;
+	__u32 __reserved;
+	__u64 s1ctxptr;
+	__u64 s1cdmax;
+	__u64 s1fmt;
+	__u64 s1dss;
+};
+
 /**
  * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
  * @size: sizeof(struct iommu_hwpt_alloc)
@@ -446,6 +471,8 @@  struct iommu_hwpt_intel_vtd {
  * +------------------------------+-------------------------------------+-----------+
  * | IOMMU_HWPT_TYPE_VTD_S1       |      struct iommu_hwpt_intel_vtd    |    HWPT   |
  * +------------------------------+-------------------------------------+-----------+
+ * | IOMMU_HWPT_TYPE_ARM_SMMUV3   |      struct iommu_hwpt_arm_smmuv3   | IOAS/HWPT |
+ * +------------------------------+-------------------------------------------------+
  */
 struct iommu_hwpt_alloc {
 	__u32 size;
@@ -463,10 +490,12 @@  struct iommu_hwpt_alloc {
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_DEFAULT,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD,
+	IOMMU_HW_INFO_TYPE_ARM_SMMUV3,
 };
 
 /**
@@ -591,6 +620,25 @@  struct iommu_hwpt_invalidate_intel_vtd {
 	__u64 nb_granules;
 };
 
+/**
+ * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
+ * @flags: boolean attributes of cache invalidation command
+ * @opcode: opcode of cache invalidation command
+ * @ssid: SubStream ID
+ * @granule_size: page/block size of the mapping in bytes
+ * @range: IOVA range to invalidate
+ */
+struct iommu_hwpt_invalidate_arm_smmuv3 {
+#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF	(1 << 0)
+	__u64 flags;
+	__u8 opcode;
+	__u8 padding[3];
+	__u32 asid;
+	__u32 ssid;
+	__u32 granule_size;
+	struct iommu_iova_range range;
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)
@@ -609,6 +657,8 @@  struct iommu_hwpt_invalidate_intel_vtd {
  * +------------------------------+----------------------------------------+
  * | IOMMU_HWPT_TYPE_VTD_S1       | struct iommu_hwpt_invalidate_intel_vtd |
  * +------------------------------+----------------------------------------+
+ * | IOMMU_HWPT_TYPE_ARM_SMMUV3   | struct iommu_hwpt_invalidate_arm_smmuv3|
+ * +------------------------------+----------------------------------------+
  */
 struct iommu_hwpt_invalidate {
 	__u32 size;