[v7,1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

Message ID 20231117131816.24359-2-yi.l.liu@intel.com
State New
Headers
Series Add Intel VT-d nested translation (part 2/2) |

Commit Message

Yi Liu Nov. 17, 2023, 1:18 p.m. UTC
  This adds the data structure for flushing iotlb for the nested domain
allocated with IOMMU_HWPT_DATA_VTD_S1 type.

This only supports invalidating IOTLB, but no for device-TLB as device-TLB
invalidation will be covered automatically in the IOTLB invalidation if the
underlying IOMMU driver has enabled ATS for the affected device.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
  

Comments

Tian, Kevin Nov. 20, 2023, 8:26 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, November 17, 2023 9:18 PM
> 
> This adds the data structure for flushing iotlb for the nested domain
> allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> 
> This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> invalidation will be covered automatically in the IOTLB invalidation if the
> underlying IOMMU driver has enabled ATS for the affected device.

"no for device-TLB" is misleading. Here just say that cache invalidation
request applies to both IOTLB and device TLB (if ATS is enabled ...)

> 
> +/**
> + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
> + *                                           stage-1 cache invalidation
> + * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only
> the
> + *                            leaf PTE caching needs to be invalidated
> + *                            and other paging structure caches can be
> + *                            preserved.

remove the words after 'and'

> +
> +/**
> + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
> + *                                       (IOMMU_HWPT_DATA_VTD_S1)
> + * @addr: The start address of the addresses to be invalidated. It needs
> + *        to be 4KB aligned.

remove "of the addresses"

> + * @npages: Number of contiguous 4K pages to be invalidated.
> + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
> + * @__reserved: Must be 0
> + *
> + * The Intel VT-d specific invalidation data for user-managed stage-1 cache
> + * invalidation in nested translation. Userspace uses this structure to
> + * tell the impacted cache scope after modifying the stage-1 page table.
> + *
> + * Invalidating all the caches related to the page table by setting @addr
> + * to be 0 and @npages to be __aligned_u64(-1). This includes the
> + * corresponding device-TLB if ATS is enabled on the attached devices.

put words about device-TLB to last paragraph. Putting it here is confusing
as if it only applies to invalidate-all.
  
Jason Gunthorpe Nov. 20, 2023, 11:04 p.m. UTC | #2
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, November 17, 2023 9:18 PM
> > 
> > This adds the data structure for flushing iotlb for the nested domain
> > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > 
> > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> > invalidation will be covered automatically in the IOTLB invalidation if the
> > underlying IOMMU driver has enabled ATS for the affected device.
> 
> "no for device-TLB" is misleading. Here just say that cache invalidation
> request applies to both IOTLB and device TLB (if ATS is enabled ...)

I think we should forward the ATS invalidation from the guest too?
That is what ARM and AMD will have to do, can we keep them all
consistent?

I understand Intel keeps track of enough stuff to know what the RIDs
are, but is it necessary to make it different?

Jason
  
Tian, Kevin Nov. 21, 2023, 2:54 a.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 21, 2023 7:05 AM
> 
> On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, November 17, 2023 9:18 PM
> > >
> > > This adds the data structure for flushing iotlb for the nested domain
> > > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > >
> > > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> > > invalidation will be covered automatically in the IOTLB invalidation if the
> > > underlying IOMMU driver has enabled ATS for the affected device.
> >
> > "no for device-TLB" is misleading. Here just say that cache invalidation
> > request applies to both IOTLB and device TLB (if ATS is enabled ...)
> 
> I think we should forward the ATS invalidation from the guest too?
> That is what ARM and AMD will have to do, can we keep them all
> consistent?
> 
> I understand Intel keeps track of enough stuff to know what the RIDs
> are, but is it necessary to make it different?
> 

probably ask the other way. Now intel-iommu driver always flushes
iotlb and device tlb together then is it necessary to separate them
in uAPI for no good (except doubled syscalls)? :)

anyway this is driver specific contract. I don't see a need to keep
it consistent for all.
  
Jason Gunthorpe Nov. 21, 2023, 12:17 p.m. UTC | #4
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 21, 2023 7:05 AM
> > 
> > On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Friday, November 17, 2023 9:18 PM
> > > >
> > > > This adds the data structure for flushing iotlb for the nested domain
> > > > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > > >
> > > > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> > > > invalidation will be covered automatically in the IOTLB invalidation if the
> > > > underlying IOMMU driver has enabled ATS for the affected device.
> > >
> > > "no for device-TLB" is misleading. Here just say that cache invalidation
> > > request applies to both IOTLB and device TLB (if ATS is enabled ...)
> > 
> > I think we should forward the ATS invalidation from the guest too?
> > That is what ARM and AMD will have to do, can we keep them all
> > consistent?
> > 
> > I understand Intel keeps track of enough stuff to know what the RIDs
> > are, but is it necessary to make it different?
> 
> probably ask the other way. Now intel-iommu driver always flushes
> iotlb and device tlb together then is it necessary to separate them
> in uAPI for no good (except doubled syscalls)? :)

I wish I knew more about Intel CC design to be able to answer that :|

Doesn't the VM issue the ATC flush command regardless? How does it
know it has a working ATC but does not need to flush it?

> anyway this is driver specific contract. I don't see a need to keep
> it consistent for all.

Given that ARM and AMD need this and would have serious bugs if it
didn't work this way I'm mildly concerned that Intel will be missing
something here..

To my mind it seems like this is just a hold over from the prior
design.

Jason
  
Baolu Lu Nov. 22, 2023, 2:32 a.m. UTC | #5
On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
> On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Tuesday, November 21, 2023 7:05 AM
>>>
>>> On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Friday, November 17, 2023 9:18 PM
>>>>>
>>>>> This adds the data structure for flushing iotlb for the nested domain
>>>>> allocated with IOMMU_HWPT_DATA_VTD_S1 type.
>>>>>
>>>>> This only supports invalidating IOTLB, but no for device-TLB as device-TLB
>>>>> invalidation will be covered automatically in the IOTLB invalidation if the
>>>>> underlying IOMMU driver has enabled ATS for the affected device.
>>>>
>>>> "no for device-TLB" is misleading. Here just say that cache invalidation
>>>> request applies to both IOTLB and device TLB (if ATS is enabled ...)
>>>
>>> I think we should forward the ATS invalidation from the guest too?
>>> That is what ARM and AMD will have to do, can we keep them all
>>> consistent?
>>>
>>> I understand Intel keeps track of enough stuff to know what the RIDs
>>> are, but is it necessary to make it different?
>>
>> probably ask the other way. Now intel-iommu driver always flushes
>> iotlb and device tlb together then is it necessary to separate them
>> in uAPI for no good (except doubled syscalls)? :)
> 
> I wish I knew more about Intel CC design to be able to answer that :|
> 
> Doesn't the VM issue the ATC flush command regardless? How does it
> know it has a working ATC but does not need to flush it?
> 

The Intel VT-d spec doesn't require the driver to flush iotlb and device
tlb together. Therefore, the current approach of relying on caching mode
to determine whether device TLB invalidation is necessary appears to be
a performance optimization rather than an architectural requirement.

The vIOMMU driver assumes that it is running within a VM guest when
caching mode is enabled. This assumption leads to an omission of device
TLB invalidation, relying on the hypervisor to perform a combined flush
of the IOLB and device TLB.

While this optimization aims to reduce VMEXIT overhead, it introduces
potential issues:

- When a Linux guest running on a hypervisor other than KVM/QEMU, the
   assumption of combined IOLB and device TLB flushing by the hypervisor
   may be incorrect, potentially leading to missed device TLB
   invalidation.

- The caching mode doesn't apply to first-stage translation. Therefore,
   if the driver uses first-stage translation and still relies on caching
   mode to determine device TLB invalidation, the optimization fails.

A more reasonable optimization would be to allocate a bit in the iommu
capability registers. The vIOMMU driver could then leverage this bit to
determine whether it could eliminate a device invalidation request.

Best regards,
baolu
  
Yi Liu Nov. 22, 2023, 3:52 a.m. UTC | #6
On 2023/11/22 10:32, Baolu Lu wrote:
> On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
>> On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, November 21, 2023 7:05 AM
>>>>
>>>> On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Sent: Friday, November 17, 2023 9:18 PM
>>>>>>
>>>>>> This adds the data structure for flushing iotlb for the nested domain
>>>>>> allocated with IOMMU_HWPT_DATA_VTD_S1 type.
>>>>>>
>>>>>> This only supports invalidating IOTLB, but no for device-TLB as 
>>>>>> device-TLB
>>>>>> invalidation will be covered automatically in the IOTLB invalidation 
>>>>>> if the
>>>>>> underlying IOMMU driver has enabled ATS for the affected device.
>>>>>
>>>>> "no for device-TLB" is misleading. Here just say that cache invalidation
>>>>> request applies to both IOTLB and device TLB (if ATS is enabled ...)
>>>>
>>>> I think we should forward the ATS invalidation from the guest too?
>>>> That is what ARM and AMD will have to do, can we keep them all
>>>> consistent?
>>>>
>>>> I understand Intel keeps track of enough stuff to know what the RIDs
>>>> are, but is it necessary to make it different?
>>>
>>> probably ask the other way. Now intel-iommu driver always flushes
>>> iotlb and device tlb together then is it necessary to separate them
>>> in uAPI for no good (except doubled syscalls)? :)
>>
>> I wish I knew more about Intel CC design to be able to answer that :|
>>
>> Doesn't the VM issue the ATC flush command regardless? How does it
>> know it has a working ATC but does not need to flush it?
>>
> 
> The Intel VT-d spec doesn't require the driver to flush iotlb and device
> tlb together.

Spec has below description. Although it does not say iotlb and device tlb
should be flushed together, but there is indeed requirement that both 
should be flushed when a page is unmapped.

Chapter 6.5.2.5:
"Since translation requests-without-PASID from a device may be serviced by 
hardware from the
IOTLB, software must always request IOTLB invalidation (iotlb_inv_dsc) 
before requesting
corresponding Device-TLB (dev_tlb_inv_dsc) invalidation."

> Therefore, the current approach of relying on caching mode
> to determine whether device TLB invalidation is necessary appears to be
> a performance optimization rather than an architectural requirement.
> 
> The vIOMMU driver assumes that it is running within a VM guest when
> caching mode is enabled. This assumption leads to an omission of device
> TLB invalidation, relying on the hypervisor to perform a combined flush
> of the IOLB and device TLB.

yes, this is what the current intel iommu driver does. However, whether
rely on caching mode or not is orthogonal with whether we need to uapis
here. I think guest iommu driver could submit both iotlb and device tlb
invalidation request. But Qemu could select if it needs to forward the
device tlb invalidation request to kernel if kernel iommu driver has
already covered the device tlb invalidation when get the request to
invalidate iotlb.

> While this optimization aims to reduce VMEXIT overhead, it introduces
> potential issues:
> 
> - When a Linux guest running on a hypervisor other than KVM/QEMU, the
>    assumption of combined IOLB and device TLB flushing by the hypervisor
>    may be incorrect, potentially leading to missed device TLB
>    invalidation.

Hmmm, this appears to be an intel iommu driver bug, it should submit both
iotlb invalidation and device tlb invalidation requests. But as above, I
think this is orthogonal here. KVM/QEMU could define its own uapi based on
the implementation to gain the best suit.

> 
> - The caching mode doesn't apply to first-stage translation. Therefore,
>    if the driver uses first-stage translation and still relies on caching
>    mode to determine device TLB invalidation, the optimization fails.

yes, caching mode does no apply to first-stage translation table. But in
nested translation, guest does not need to notify hypervisor when there is
page unmapped. is it? So whether caching mode applies to first-stage
translation table does not matter. TBH. I didn't see the problem due to
this reason. But I agree that linux guest intel iommu driver needs to
submit both iotlb and device tlb invalidation request to guarantee it can
work on other hypervisors. And there should be other way to do the
performance optimization.

> 
> A more reasonable optimization would be to allocate a bit in the iommu
> capability registers. The vIOMMU driver could then leverage this bit to
> determine whether it could eliminate a device invalidation request.

this may be something spec can be enhanced. But again it is just to make
guest intel iommu driver to gain performance optimization and also can
work on other hypervisors. As of this uapi design, considering it within
the linux ecosystem is enough.
  
Tian, Kevin Nov. 22, 2023, 4:58 a.m. UTC | #7
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 21, 2023 8:17 PM
> 
> On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, November 21, 2023 7:05 AM
> > >
> > > On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Friday, November 17, 2023 9:18 PM
> > > > >
> > > > > This adds the data structure for flushing iotlb for the nested domain
> > > > > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > > > >
> > > > > This only supports invalidating IOTLB, but no for device-TLB as device-
> TLB
> > > > > invalidation will be covered automatically in the IOTLB invalidation if
> the
> > > > > underlying IOMMU driver has enabled ATS for the affected device.
> > > >
> > > > "no for device-TLB" is misleading. Here just say that cache invalidation
> > > > request applies to both IOTLB and device TLB (if ATS is enabled ...)
> > >
> > > I think we should forward the ATS invalidation from the guest too?
> > > That is what ARM and AMD will have to do, can we keep them all
> > > consistent?
> > >
> > > I understand Intel keeps track of enough stuff to know what the RIDs
> > > are, but is it necessary to make it different?
> >
> > probably ask the other way. Now intel-iommu driver always flushes
> > iotlb and device tlb together then is it necessary to separate them
> > in uAPI for no good (except doubled syscalls)? :)
> 
> I wish I knew more about Intel CC design to be able to answer that :|
> 
> Doesn't the VM issue the ATC flush command regardless? How does it
> know it has a working ATC but does not need to flush it?
> 
> > anyway this is driver specific contract. I don't see a need to keep
> > it consistent for all.
> 
> Given that ARM and AMD need this and would have serious bugs if it
> didn't work this way I'm mildly concerned that Intel will be missing
> something here..
> 
> To my mind it seems like this is just a hold over from the prior
> design.
> 

As Yi/Baolu discussed there is an issue in intel-iommu driver which
incorrectly skips devtlb invalidation in the guest with the assumption
that the host combines iotlb/devtlb invalidation together. This is
incorrect and should be fixed.

But what I was talking about earlier is about the uAPI between
viommu and iommu driver. I don't see a need of having separate
invalidation cmds for each, as I'm not sure what the user can
expect in the window when iotlb and devtlb are out of sync.

then we just define hwpt 'cache' invalidation in vtd always refers to
both iotlb and devtlb. Then viommu just needs to call invalidation
uapi once when emulating virtual iotlb invalidation descriptor
while emulating the following devtlb invalidation descriptor
as a nop.
  
Jason Gunthorpe Nov. 22, 2023, 1:25 p.m. UTC | #8
On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:

> As Yi/Baolu discussed there is an issue in intel-iommu driver which
> incorrectly skips devtlb invalidation in the guest with the assumption
> that the host combines iotlb/devtlb invalidation together. This is
> incorrect and should be fixed.

Yes, this seems quite problematic - you guys will have to think of
something and decide what kind of backward compat you want :(

Maybe the viommu driver can observe the guest and if it sees an ATC
invalidation assume it is non-buggy, until seen it can do a combined
flush.

> But what I was talking about earlier is about the uAPI between
> viommu and iommu driver. I don't see a need of having separate
> invalidation cmds for each, as I'm not sure what the user can
> expect in the window when iotlb and devtlb are out of sync.

If the guest is always issuing the device invalidation then I don't
see too much point in suppressing it in the kernel. Just forward it
naturally.

> then we just define hwpt 'cache' invalidation in vtd always refers to
> both iotlb and devtlb. Then viommu just needs to call invalidation
> uapi once when emulating virtual iotlb invalidation descriptor
> while emulating the following devtlb invalidation descriptor
> as a nop.

In principle ATC and IOMMU TLB invalidations should not always be
linked.

Any scenario that allows devices to share an IOTLB cache tag requires
fewer IOMMU TLB invalidations than ATC invalidations.

I like the view of this invalidation interface as reflecting the
actual HW and not trying to be smarter an real HW.

I'm fully expecting that Intel will adopt an direct-DMA flush queue
like SMMU and AMD have already done as a performance optimization. In
this world it makes no sense that the behavior of the direct DMA queue
and driver mediated queue would be different.

Jason
  
Tian, Kevin Nov. 24, 2023, 3 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 22, 2023 9:26 PM
> 
> On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:
> > then we just define hwpt 'cache' invalidation in vtd always refers to
> > both iotlb and devtlb. Then viommu just needs to call invalidation
> > uapi once when emulating virtual iotlb invalidation descriptor
> > while emulating the following devtlb invalidation descriptor
> > as a nop.
> 
> In principle ATC and IOMMU TLB invalidations should not always be
> linked.
> 
> Any scenario that allows devices to share an IOTLB cache tag requires
> fewer IOMMU TLB invalidations than ATC invalidations.

as long as the host iommu driver has the same knowledge then it will
always do the right thing.

e.g. one iotlb entry shared by 4 devices.

guest issues:
	1) iotlb invalidation
	2) devtlb invalidation for dev1
	3) devtlb invalidation for dev2
	4) devtlb invalidation for dev3
	5) devtlb invalidation for dev4

intel-viommu calls HWPT cache invalidation for 1) and treats 2-5) as nop.

intel-iommu driver internally knows the iotlb is shared by 4 devices (given
the same domain is attached to those devices) to handle HWPT
cache invalidation:

	1) iotlb invalidation
	2) devtlb invalidation for dev1
	3) devtlb invalidation for dev2
	4) devtlb invalidation for dev3
	5) devtlb invalidation for dev4

this is a good optimization by reducing 5 syscalls to 1, with the 
assumption that the guest shouldn't expect any deterministic
behavior before 5) is completed to bring iotlb/devtlbs in sync.

another alternative is to have guest batch 1-5) in one request which
allows viommu to batch them in one invalidation call too. But
this is an orthogonal optimization in guest which we don't want
to rely on.

> 
> I like the view of this invalidation interface as reflecting the
> actual HW and not trying to be smarter an real HW.

the guest-oriented interface e.g. viommu reflects the HW.

uAPI is kind of viommu internal implementation. IMHO it's not
a bad thing to make it smarter as long as no guest observable
breakage.

> 
> I'm fully expecting that Intel will adopt an direct-DMA flush queue
> like SMMU and AMD have already done as a performance optimization. In
> this world it makes no sense that the behavior of the direct DMA queue
> and driver mediated queue would be different.
> 

that's a orthogonal topic. I don't think the value of direct-DMA flush
queue should prevent possible optimization in the mediation path
(as long as guest-expected deterministic behavior is sustained).
  
Jason Gunthorpe Nov. 24, 2023, 1:46 p.m. UTC | #10
On Fri, Nov 24, 2023 at 03:00:45AM +0000, Tian, Kevin wrote:

> > I'm fully expecting that Intel will adopt an direct-DMA flush queue
> > like SMMU and AMD have already done as a performance optimization. In
> > this world it makes no sense that the behavior of the direct DMA queue
> > and driver mediated queue would be different.
> > 
> 
> that's a orthogonal topic. I don't think the value of direct-DMA flush
> queue should prevent possible optimization in the mediation path
> (as long as guest-expected deterministic behavior is sustained).

Okay, well as long as the guest is generating the ATC invalidations we
can always make an iommufd API flag to include or exclude the ATC
invalidation when doing the ASID invalidation. So we aren't trapped

Jason
  
Yi Liu Dec. 14, 2023, 11:26 a.m. UTC | #11
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for 
flushing iotlb for the nested domain
 > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
 >
 > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
 > invalidation will be covered automatically in the IOTLB invalidation if the
 > underlying IOMMU driver has enabled ATS for the affected device.
 >
 > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
 > ---
 >   include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++
 >   1 file changed, 36 insertions(+)
 >
 > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
 > index 7f92cecc87d7..cafd98642abf 100644
 > --- a/include/uapi/linux/iommufd.h
 > +++ b/include/uapi/linux/iommufd.h
 > @@ -614,6 +614,42 @@ struct iommu_hwpt_get_dirty_bitmap {
 >   #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
 >   					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
 >
 > +/**
 > + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
 > + *                                           stage-1 cache invalidation
 > + * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
 > + *                            leaf PTE caching needs to be invalidated
 > + *                            and other paging structure caches can be
 > + *                            preserved.
 > + */
 > +enum iommu_hwpt_vtd_s1_invalidate_flags {
 > +	IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
 > +};
 > +
 > +/**
 > + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
 > + *                                       (IOMMU_HWPT_DATA_VTD_S1)
 > + * @addr: The start address of the addresses to be invalidated. It needs
 > + *        to be 4KB aligned.
 > + * @npages: Number of contiguous 4K pages to be invalidated.
 > + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
 > + * @__reserved: Must be 0
 > + *
 > + * The Intel VT-d specific invalidation data for user-managed stage-1 cache
 > + * invalidation in nested translation. Userspace uses this structure to
 > + * tell the impacted cache scope after modifying the stage-1 page table.
 > + *
 > + * Invalidating all the caches related to the page table by setting @addr
 > + * to be 0 and @npages to be __aligned_u64(-1). This includes the
 > + * corresponding device-TLB if ATS is enabled on the attached devices.
 > + */
 > +struct iommu_hwpt_vtd_s1_invalidate {
 > +	__aligned_u64 addr;
 > +	__aligned_u64 npages;
 > +	__u32 flags;
 > +	__u32 __reserved;
 > +};
 > +
 >   /**
 >    * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
 >    * @size: sizeof(struct iommu_hwpt_invalidate)

Hi Jason, Kevin,

Per the prior discussion[1], we agreed to move the error reporting into the
driver specific part. On Intel side, we want to report two devTLB
invalidation errors: ICE (invalid completion error) and ITE (invalidation
timeout error). Such errors have an additional SID information to tell
which device failed the devTLB invalidation. I've got the below structure.

+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *                                       (IOMMU_HWPT_DATA_VTD_S1)
+ * @addr: The start address of the addresses to be invalidated. It needs
+ *        to be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0.
+ * @error: One of enum iommu_hwpt_vtd_s1_invalidate_error_code
+ * @dev_id: The device in the invalidation completion message, it's meaninfful
+ *          when @error_code is set to IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ICE
+ *          or IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ITE.
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be U64_MAX.
+ *
+ * @error_code is meaningful only if the request is handled by kernel. This
+ * can be known by checking struct iommu_hwpt_invalidate::req_num output.
+ * @error_code only covers the errors detected by hardware after submitting
+ * the invalidation. The software detected errors would go through the normal
+ * ioctl errno.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+	__aligned_u64 addr;
+	__aligned_u64 npages;
+	__u32 flags;
+	__u32 __reserved;
+	__u32 error;
+	__u32 dev_id;
+};

dev_id is used to report the failed device, userspace should be able to map
it to a vRID, and inject it to VM as part of ITE/ICE error.

However, I got a problem when trying to get dev_id in cache invalidation
path, since this is filled in intel iommu driver. Seems like there is no
good way for it. I've below alternatives to move forward, wish you have
a look.

- Reporting pSID instead of dev_id. This may not work if userspace for
example Qemu cen get a vfio device cdev fd from management stack. Maybe you
have different opinion, do let me know.

- Let iommufd to convert a SID info or device pointer to a dev_id, and then
report it back to userspace. This seems easiest, but breaks layer and also
requires vt-d specific logic. :(

- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel
iommu driver does not have viommu context, no place to hold the vRID->pRID
mapping. TBH. It may require other reasons to introduce it other than the
error reporting need. Anyhow, this requires more thinking and also has
dependency even if it is doable in intel side.

- Only report error code, but no device info at first. May adding the
device info (dev_id or vRID) in future series. In reality, the existing
Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d
spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info
later in 2019 spec 3.1. We may do it in stage as well.

What about your opinion?

[1] 
https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-3ec52e5bc81a@intel.com/
  
Tian, Kevin Dec. 15, 2023, 1:50 a.m. UTC | #12
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 14, 2023 7:27 PM
> 
> On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> flushing iotlb for the nested domain
> 
> +struct iommu_hwpt_vtd_s1_invalidate {
> +	__aligned_u64 addr;
> +	__aligned_u64 npages;
> +	__u32 flags;
> +	__u32 __reserved;
> +	__u32 error;
> +	__u32 dev_id;
> +};
> 
> dev_id is used to report the failed device, userspace should be able to map
> it to a vRID, and inject it to VM as part of ITE/ICE error.
> 
> However, I got a problem when trying to get dev_id in cache invalidation
> path, since this is filled in intel iommu driver. Seems like there is no
> good way for it. I've below alternatives to move forward, wish you have
> a look.
> 
> - Reporting pSID instead of dev_id. This may not work if userspace for
> example Qemu cen get a vfio device cdev fd from management stack. Maybe
> you
> have different opinion, do let me know.

yes, there is no guarantee that pRID is always visible to the user.

> 
> - Let iommufd to convert a SID info or device pointer to a dev_id, and then
> report it back to userspace. This seems easiest, but breaks layer and also
> requires vt-d specific logic. :(

yes, the current philosophy of iommufd is to put diver specific knowledge
out of iommufd.

> 
> - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> maintained, then intel iommu can report a vRID back to user. But intel
> iommu driver does not have viommu context, no place to hold the vRID-
> >pRID
> mapping. TBH. It may require other reasons to introduce it other than the
> error reporting need. Anyhow, this requires more thinking and also has
> dependency even if it is doable in intel side.

this sounds like a cleaner way to inject knowledge which iommu driver
requires to find out the user tag. but yes it's a bit weird to introduce
viommu awareness in intel iommu driver when there is no such thing
in real hardware.

and for this error reporting case what we actually require is the
reverse map i.e. pRID->vRID. Not sure whether we can leverage the
same RID mapping uAPI as for ARM/AMD but ignore viommu_id
and then store vRID under device_domain_info. a bit tricky on
life cycle management and also incompatible with SIOV...

let's see whether Jason has a better idea here.

> 
> - Only report error code, but no device info at first. May adding the
> device info (dev_id or vRID) in future series. In reality, the existing
> Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d

and IOAS_UNMAP doesn't provide such ATS error either.

> spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info
> later in 2019 spec 3.1. We may do it in stage as well.

and it's not tied to a specific iommu version. the spec is stated in 
a way that software treats zero value in SID as no hw support so
theoretically even a modern hw may not report SID for certain
reason.

> 
> What about your opinion?
> 
> [1]
> https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-
> 3ec52e5bc81a@intel.com/

I'm fine with this staged approach given the spec allows this
behavior and no vIOMMU properly emulates ITE/ICE today. 

let's work out a new version w/o dev info and make sure it's
in a good state first. Then let Jason decide next week whether
he wants to take it for this cycle or not.
  
Nicolin Chen Dec. 15, 2023, 2:28 a.m. UTC | #13
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, December 14, 2023 7:27 PM
> >
> > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > flushing iotlb for the nested domain
> >
> > +struct iommu_hwpt_vtd_s1_invalidate {
> > +     __aligned_u64 addr;
> > +     __aligned_u64 npages;
> > +     __u32 flags;
> > +     __u32 __reserved;
> > +     __u32 error;
> > +     __u32 dev_id;
> > +};
> >
> > dev_id is used to report the failed device, userspace should be able to map
> > it to a vRID, and inject it to VM as part of ITE/ICE error.
> >
> > However, I got a problem when trying to get dev_id in cache invalidation
> > path, since this is filled in intel iommu driver. Seems like there is no
> > good way for it. I've below alternatives to move forward, wish you have
> > a look.

> >
> > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > maintained, then intel iommu can report a vRID back to user. But intel
> > iommu driver does not have viommu context, no place to hold the vRID-
> > >pRID
> > mapping. TBH. It may require other reasons to introduce it other than the
> > error reporting need. Anyhow, this requires more thinking and also has
> > dependency even if it is doable in intel side.
> 
> this sounds like a cleaner way to inject knowledge which iommu driver
> requires to find out the user tag. but yes it's a bit weird to introduce
> viommu awareness in intel iommu driver when there is no such thing
> in real hardware.

I think a viommu is defined more like a software object representing
the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case,
there could be an object for it too?

> and for this error reporting case what we actually require is the
> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> and then store vRID under device_domain_info. a bit tricky on
> life cycle management and also incompatible with SIOV...

One thing that I am not very clear here: since both vRID and dev_id
are given by the VMM, shouldn't it already know the mapping if the
point is to translate (pRID->)dev_id->vRID?

Thanks
Nicolin
  
Tian, Kevin Dec. 15, 2023, 3:04 a.m. UTC | #14
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, December 15, 2023 10:28 AM
> 
> On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Thursday, December 14, 2023 7:27 PM
> > >
> > > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > > flushing iotlb for the nested domain
> > >
> > > +struct iommu_hwpt_vtd_s1_invalidate {
> > > +     __aligned_u64 addr;
> > > +     __aligned_u64 npages;
> > > +     __u32 flags;
> > > +     __u32 __reserved;
> > > +     __u32 error;
> > > +     __u32 dev_id;
> > > +};
> > >
> > > dev_id is used to report the failed device, userspace should be able to
> map
> > > it to a vRID, and inject it to VM as part of ITE/ICE error.
> > >
> > > However, I got a problem when trying to get dev_id in cache invalidation
> > > path, since this is filled in intel iommu driver. Seems like there is no
> > > good way for it. I've below alternatives to move forward, wish you have
> > > a look.
> 
> > >
> > > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > > maintained, then intel iommu can report a vRID back to user. But intel
> > > iommu driver does not have viommu context, no place to hold the vRID-
> > > >pRID
> > > mapping. TBH. It may require other reasons to introduce it other than the
> > > error reporting need. Anyhow, this requires more thinking and also has
> > > dependency even if it is doable in intel side.
> >
> > this sounds like a cleaner way to inject knowledge which iommu driver
> > requires to find out the user tag. but yes it's a bit weird to introduce
> > viommu awareness in intel iommu driver when there is no such thing
> > in real hardware.
> 
> I think a viommu is defined more like a software object representing
> the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case,
> there could be an object for it too?

for VT-d it's not necessary to maintain such vIOMMU awareness in
the kernel (before this error reporting case) given its interfaces are
simply around hwpt's. there is no vIOMMU-scope operation provided
by intel-iommu driver so far.

> 
> > and for this error reporting case what we actually require is the
> > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > and then store vRID under device_domain_info. a bit tricky on
> > life cycle management and also incompatible with SIOV...
> 
> One thing that I am not very clear here: since both vRID and dev_id
> are given by the VMM, shouldn't it already know the mapping if the
> point is to translate (pRID->)dev_id->vRID?
> 

it's true for current Qemu.

but there is plan to support Qemu accepting a fd passed by Libvirt.
In that case Qemu even doesn't see the sysfs path hence is not
aware of pRID. otherwise yes we could leave the translation to
VMM instead.
  
Nicolin Chen Dec. 15, 2023, 3:32 a.m. UTC | #15
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, December 15, 2023 10:28 AM
> > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > >
> > > > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > > > flushing iotlb for the nested domain
> > > >
> > > > +struct iommu_hwpt_vtd_s1_invalidate {
> > > > +     __aligned_u64 addr;
> > > > +     __aligned_u64 npages;
> > > > +     __u32 flags;
> > > > +     __u32 __reserved;
> > > > +     __u32 error;
> > > > +     __u32 dev_id;
> > > > +};
> > > >
> > > > dev_id is used to report the failed device, userspace should be able to
> > map
> > > > it to a vRID, and inject it to VM as part of ITE/ICE error.

> > > and for this error reporting case what we actually require is the
> > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > and then store vRID under device_domain_info. a bit tricky on
> > > life cycle management and also incompatible with SIOV...
> >
> > One thing that I am not very clear here: since both vRID and dev_id
> > are given by the VMM, shouldn't it already know the mapping if the
> > point is to translate (pRID->)dev_id->vRID?
> >
> 
> it's true for current Qemu.
> 
> but there is plan to support Qemu accepting a fd passed by Libvirt.
> In that case Qemu even doesn't see the sysfs path hence is not
> aware of pRID. otherwise yes we could leave the translation to
> VMM instead.

I think I misread Yi's narrative: dev_id is a working approach
for VMM to convert to a vRID, while he is asking for a better
alternative :)
  
Yi Liu Dec. 15, 2023, 4:01 a.m. UTC | #16
On 2023/12/15 11:32, Nicolin Chen wrote:
> On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Sent: Friday, December 15, 2023 10:28 AM
>>> On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Thursday, December 14, 2023 7:27 PM
>>>>>
>>>>> On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
>>>>> flushing iotlb for the nested domain
>>>>>
>>>>> +struct iommu_hwpt_vtd_s1_invalidate {
>>>>> +     __aligned_u64 addr;
>>>>> +     __aligned_u64 npages;
>>>>> +     __u32 flags;
>>>>> +     __u32 __reserved;
>>>>> +     __u32 error;
>>>>> +     __u32 dev_id;
>>>>> +};
>>>>>
>>>>> dev_id is used to report the failed device, userspace should be able to
>>> map
>>>>> it to a vRID, and inject it to VM as part of ITE/ICE error.
> 
>>>> and for this error reporting case what we actually require is the
>>>> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
>>>> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
>>>> and then store vRID under device_domain_info. a bit tricky on
>>>> life cycle management and also incompatible with SIOV...
>>>
>>> One thing that I am not very clear here: since both vRID and dev_id
>>> are given by the VMM, shouldn't it already know the mapping if the
>>> point is to translate (pRID->)dev_id->vRID?
>>>
>>
>> it's true for current Qemu.
>>
>> but there is plan to support Qemu accepting a fd passed by Libvirt.
>> In that case Qemu even doesn't see the sysfs path hence is not
>> aware of pRID. otherwise yes we could leave the translation to
>> VMM instead.
> 
> I think I misread Yi's narrative: dev_id is a working approach
> for VMM to convert to a vRID, while he is asking for a better
> alternative :)

In concept, dev_id works, but in reality we have problem to get a dev_id
for a given device in intel iommu driver, hence I'm asking for help here. :)
  
Nicolin Chen Dec. 16, 2023, 6:49 p.m. UTC | #17
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> On 2023/12/15 11:32, Nicolin Chen wrote:
> > On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Friday, December 15, 2023 10:28 AM
> > > > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > > > > 
> > > > > > On 2023/11/17 21:18, Yi Liu wrote:
> > > > > and for this error reporting case what we actually require is the
> > > > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > > > and then store vRID under device_domain_info. a bit tricky on
> > > > > life cycle management and also incompatible with SIOV...
> > > > 
> > > > One thing that I am not very clear here: since both vRID and dev_id
> > > > are given by the VMM, shouldn't it already know the mapping if the
> > > > point is to translate (pRID->)dev_id->vRID?
> > > > 
> > > 
> > > it's true for current Qemu.
> > > 
> > > but there is plan to support Qemu accepting a fd passed by Libvirt.
> > > In that case Qemu even doesn't see the sysfs path hence is not
> > > aware of pRID. otherwise yes we could leave the translation to
> > > VMM instead.
> > 
> > I think I misread Yi's narrative: dev_id is a working approach
> > for VMM to convert to a vRID, while he is asking for a better
> > alternative :)
> 
> In concept, dev_id works, but in reality we have problem to get a dev_id
> for a given device in intel iommu driver, hence I'm asking for help here. :)

Yea, I got that.

Would it be possible for us to postpone this error report in the
vtd driver?

Jason is taking vacation, so he'll unlikely be very active until
the new year, although he would probably spare some time taking
the cache_invalidate series once it's mature.

If the final solution is to use pRID<->vRID mappings for vtd too,
we'd likely need the viommu/dev_set_virtual_id series that I am
still working on, which certainly won't make it to this cycle.

Thanks
Nic
  
Tian, Kevin Dec. 17, 2023, 11:28 p.m. UTC | #18
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, December 17, 2023 2:49 AM
> 
> On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > On 2023/12/15 11:32, Nicolin Chen wrote:
> > > On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Friday, December 15, 2023 10:28 AM
> > > > > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > > > > >
> > > > > > > On 2023/11/17 21:18, Yi Liu wrote:
> > > > > > and for this error reporting case what we actually require is the
> > > > > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > > > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > > > > and then store vRID under device_domain_info. a bit tricky on
> > > > > > life cycle management and also incompatible with SIOV...
> > > > >
> > > > > One thing that I am not very clear here: since both vRID and dev_id
> > > > > are given by the VMM, shouldn't it already know the mapping if the
> > > > > point is to translate (pRID->)dev_id->vRID?
> > > > >
> > > >
> > > > it's true for current Qemu.
> > > >
> > > > but there is plan to support Qemu accepting a fd passed by Libvirt.
> > > > In that case Qemu even doesn't see the sysfs path hence is not
> > > > aware of pRID. otherwise yes we could leave the translation to
> > > > VMM instead.
> > >
> > > I think I misread Yi's narrative: dev_id is a working approach
> > > for VMM to convert to a vRID, while he is asking for a better
> > > alternative :)
> >
> > In concept, dev_id works, but in reality we have problem to get a dev_id
> > for a given device in intel iommu driver, hence I'm asking for help here. :)
> 
> Yea, I got that.
> 
> Would it be possible for us to postpone this error report in the
> vtd driver?
> 

that is the plan. We'll still report error code but no dev_id error
data. this is allowed by vtd spec.
  
Jason Gunthorpe Jan. 2, 2024, 5:46 p.m. UTC | #19
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:

> > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > maintained, then intel iommu can report a vRID back to user. But intel
> > iommu driver does not have viommu context, no place to hold the vRID-
> > >pRID
> > mapping. TBH. It may require other reasons to introduce it other than the
> > error reporting need. Anyhow, this requires more thinking and also has
> > dependency even if it is doable in intel side.
> 
> this sounds like a cleaner way to inject knowledge which iommu driver
> requires to find out the user tag. but yes it's a bit weird to introduce
> viommu awareness in intel iommu driver when there is no such thing
> in real hardware.
> 
> and for this error reporting case what we actually require is the
> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> and then store vRID under device_domain_info. a bit tricky on
> life cycle management and also incompatible with SIOV...
> 
> let's see whether Jason has a better idea here.

I think v10 is OK

struct iommu_hwpt_invalidate {
	__u32 size;
	__u32 hwpt_id;
	__aligned_u64 data_uptr;
	__u32 data_type;
	__u32 entry_len;
	__u32 entry_num;
	__u32 __reserved;
};

Sends the invalidation to the HWPT which matches what Intel wanted
where the entire HWPT and all its associated devices are
invalidated. No seperate per-device invalidation.

For error and event reporting they should be returned to userspace
with the IOMMU dev_id indicating the originating PCI function.

The VMM would have to convert dev_id into vRID according to the vIOMMU
instance that the device is hooked up.

In iommu land we should never have a "RID" but always some kind of
device-specific "device ID" which is the index into the particular HW
table, and that ID is naturally scoped to within the IOMMU instance
that owns the table - so it is very much not a global ID that can be
used alone in any of the uAPI.

The uAPI should use the iommufd device ID to refer to specific
devices.

Jason
  
Jason Gunthorpe Jan. 2, 2024, 11:38 p.m. UTC | #20
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > I think I misread Yi's narrative: dev_id is a working approach
> > for VMM to convert to a vRID, while he is asking for a better
> > alternative :)
> 
> In concept, dev_id works, but in reality we have problem to get a dev_id
> for a given device in intel iommu driver, hence I'm asking for help here. :)

I think we just need to solve this one way or another.. Even if you
use a viommu object you still end up having difficult coupling to
iommufd

Some:
  iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)

Callable by a driver (using the driver-callable function
infrastructure we made for dirty tracking) Is really all that is
needed here.

Jason
  
Yi Liu Jan. 3, 2024, 2:24 a.m. UTC | #21
On 2024/1/3 07:38, Jason Gunthorpe wrote:
> On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
>>> I think I misread Yi's narrative: dev_id is a working approach
>>> for VMM to convert to a vRID, while he is asking for a better
>>> alternative :)
>>
>> In concept, dev_id works, but in reality we have problem to get a dev_id
>> for a given device in intel iommu driver, hence I'm asking for help here. :)
> 
> I think we just need to solve this one way or another.. Even if you
> use a viommu object you still end up having difficult coupling to
> iommufd
> 
> Some:
>    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> 
> Callable by a driver (using the driver-callable function
> infrastructure we made for dirty tracking) Is really all that is
> needed here.

yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
IOMMUFD is configed. Maybe such an API could be compiled when
IOMMUFD_DRIVER is enabled as well. This does address my concern on making
intel iommu driver depending on iommufd. But still need a way to pass in
the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
iommu driver could call the above iommufd_get_dev_id(). One possible way is
passing it when attaching device to domain and clear it in detach. However,
this seems not ideal as iommufd_ctx information should be static between
bind_iommufd and unbind. But we don't call into intel iommu driver in the
bind and unbind operations. May need to add new iommu op. Any suggestion
here?
  
Jason Gunthorpe Jan. 3, 2024, 4:01 p.m. UTC | #22
On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
> On 2024/1/3 07:38, Jason Gunthorpe wrote:
> > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > > > I think I misread Yi's narrative: dev_id is a working approach
> > > > for VMM to convert to a vRID, while he is asking for a better
> > > > alternative :)
> > > 
> > > In concept, dev_id works, but in reality we have problem to get a dev_id
> > > for a given device in intel iommu driver, hence I'm asking for help here. :)
> > 
> > I think we just need to solve this one way or another.. Even if you
> > use a viommu object you still end up having difficult coupling to
> > iommufd
> > 
> > Some:
> >    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> > 
> > Callable by a driver (using the driver-callable function
> > infrastructure we made for dirty tracking) Is really all that is
> > needed here.
> 
> yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
> IOMMUFD is configed. Maybe such an API could be compiled when
> IOMMUFD_DRIVER is enabled as well. This does address my concern on making
> intel iommu driver depending on iommufd. But still need a way to pass in
> the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
> iommu driver could call the above iommufd_get_dev_id(). One possible way is
> passing it when attaching device to domain and clear it in detach. However,
> this seems not ideal as iommufd_ctx information should be static between
> bind_iommufd and unbind. But we don't call into intel iommu driver in the
> bind and unbind operations. May need to add new iommu op. Any suggestion
> here?

You can pass the ctx to the invalidate op, it is already implied
because the passed iommu_domain is linked to a single iommufd ctx.

Jason
  
Nicolin Chen Jan. 3, 2024, 4:48 p.m. UTC | #23
On Wed, Jan 03, 2024 at 12:01:08PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
> > On 2024/1/3 07:38, Jason Gunthorpe wrote:
> > > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > > > > I think I misread Yi's narrative: dev_id is a working approach
> > > > > for VMM to convert to a vRID, while he is asking for a better
> > > > > alternative :)
> > > > 
> > > > In concept, dev_id works, but in reality we have problem to get a dev_id
> > > > for a given device in intel iommu driver, hence I'm asking for help here. :)
> > > 
> > > I think we just need to solve this one way or another.. Even if you
> > > use a viommu object you still end up having difficult coupling to
> > > iommufd
> > > 
> > > Some:
> > >    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> > > 
> > > Callable by a driver (using the driver-callable function
> > > infrastructure we made for dirty tracking) Is really all that is
> > > needed here.
> > 
> > yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
> > IOMMUFD is configed. Maybe such an API could be compiled when
> > IOMMUFD_DRIVER is enabled as well. This does address my concern on making
> > intel iommu driver depending on iommufd. But still need a way to pass in
> > the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
> > iommu driver could call the above iommufd_get_dev_id(). One possible way is
> > passing it when attaching device to domain and clear it in detach. However,
> > this seems not ideal as iommufd_ctx information should be static between
> > bind_iommufd and unbind. But we don't call into intel iommu driver in the
> > bind and unbind operations. May need to add new iommu op. Any suggestion
> > here?
> 
> You can pass the ctx to the invalidate op, it is already implied
> because the passed iommu_domain is linked to a single iommufd ctx.

The device virtual id lookup API needs something similar, yet it
likely needs a viommu pointer (or its id) instead? As the table
is attached to a viommu while an ictx can have multiple viommus,
right?

Thanks
Nic
  
Jason Gunthorpe Jan. 3, 2024, 4:58 p.m. UTC | #24
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > You can pass the ctx to the invalidate op, it is already implied
> > because the passed iommu_domain is linked to a single iommufd ctx.
> 
> The device virtual id lookup API needs something similar, yet it
> likely needs a viommu pointer (or its id) instead? As the table
> is attached to a viommu while an ictx can have multiple viommus,
> right?

Yes, when we get to an API for that it will have to be some op
'invalidate_viommu(..)' and it can get the necessary pointers.

The viommu object will have to be some driver object like the
iommu_domain.

Jason
  
Nicolin Chen Jan. 3, 2024, 5:06 p.m. UTC | #25
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > You can pass the ctx to the invalidate op, it is already implied
> > > because the passed iommu_domain is linked to a single iommufd ctx.
> > 
> > The device virtual id lookup API needs something similar, yet it
> > likely needs a viommu pointer (or its id) instead? As the table
> > is attached to a viommu while an ictx can have multiple viommus,
> > right?
> 
> Yes, when we get to an API for that it will have to be some op
> 'invalidate_viommu(..)' and it can get the necessary pointers.

OK! I will try that first.

> The viommu object will have to be some driver object like the
> iommu_domain.

I drafted something like this, linking it to struct iommu_device:

+struct iommufd_viommu {
+       struct iommufd_object obj;
+       struct iommufd_ctx *ictx;
+       struct iommu_device *iommu_dev;
+       struct iommufd_hwpt_paging *hwpt;
+       /* array of struct iommufd_device, indexed by device virtual id */
+       struct xarray device_ids;
+};

Thanks
Nicolin
  
Jason Gunthorpe Jan. 3, 2024, 5:52 p.m. UTC | #26
On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
> On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > > You can pass the ctx to the invalidate op, it is already implied
> > > > because the passed iommu_domain is linked to a single iommufd ctx.
> > > 
> > > The device virtual id lookup API needs something similar, yet it
> > > likely needs a viommu pointer (or its id) instead? As the table
> > > is attached to a viommu while an ictx can have multiple viommus,
> > > right?
> > 
> > Yes, when we get to an API for that it will have to be some op
> > 'invalidate_viommu(..)' and it can get the necessary pointers.
> 
> OK! I will try that first.
> 
> > The viommu object will have to be some driver object like the
> > iommu_domain.
> 
> I drafted something like this, linking it to struct iommu_device:
> 
> +struct iommufd_viommu {
> +       struct iommufd_object obj;
> +       struct iommufd_ctx *ictx;
> +       struct iommu_device *iommu_dev;
> +       struct iommufd_hwpt_paging *hwpt;
> +       /* array of struct iommufd_device, indexed by device virtual id */
> +       struct xarray device_ids;
> +};

The driver would have to create it and there would be some driver
specific enclosing struct to go with it

Perhaps device_ids goes in the driver specific struct, I don't know.

Not sure it should have hwpt at all, probably vmid should come from
the driver specific struct in some driver specific way

Jason
  
Nicolin Chen Jan. 3, 2024, 8:18 p.m. UTC | #27
On Wed, Jan 03, 2024 at 01:52:02PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
> > On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > > > You can pass the ctx to the invalidate op, it is already implied
> > > > > because the passed iommu_domain is linked to a single iommufd ctx.
> > > > 
> > > > The device virtual id lookup API needs something similar, yet it
> > > > likely needs a viommu pointer (or its id) instead? As the table
> > > > is attached to a viommu while an ictx can have multiple viommus,
> > > > right?
> > > 
> > > Yes, when we get to an API for that it will have to be some op
> > > 'invalidate_viommu(..)' and it can get the necessary pointers.
> > 
> > OK! I will try that first.
> > 
> > > The viommu object will have to be some driver object like the
> > > iommu_domain.
> > 
> > I drafted something like this, linking it to struct iommu_device:
> > 
> > +struct iommufd_viommu {
> > +       struct iommufd_object obj;
> > +       struct iommufd_ctx *ictx;
> > +       struct iommu_device *iommu_dev;
> > +       struct iommufd_hwpt_paging *hwpt;
> > +       /* array of struct iommufd_device, indexed by device virtual id */
> > +       struct xarray device_ids;
> > +};
> 
> The driver would have to create it and there would be some driver
> specific enclosing struct to go with it
> 
> Perhaps device_ids goes in the driver specific struct, I don't know.

+struct iommufd_viommu {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommu_device *iommu_dev;
+	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
+
+	int vmid;
+
+	union iommu_driver_user_data {
+		struct iommu_driver_user_vtd;
+		struct iommu_driver_user_arm_smmuv3;
+		struct iommu_driver_user_amd_viommu;
+	};
+};

Then iommu_ops would need something like:
	iommu_user_alloc/free(struct iommu_device *iommu_dev,
			      union *iommu_driver_user_data, int *vmid);
	iommu_user_set/unset_dev_id(union iommu_driver_user_data,
				    struct device* dev. u32/u64 id);
	iommu_user_invalidate(union iommu_driver_user_data *iommu,
			      struct iommu_user_data_array *array);

Comments and ideas on better naming convention?

> Not sure it should have hwpt at all, probably vmid should come from
> the driver specific struct in some driver specific way

The idea having a hwpt pointer is to share the paging hwpt among
the viommu objects. I don't think it "shouldn't have", yet likely
we can avoid it depending on whether it will have some use or not
in the context.

Nicolin
  
Jason Gunthorpe Jan. 4, 2024, 12:02 a.m. UTC | #28
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > The driver would have to create it and there would be some driver
> > specific enclosing struct to go with it
> > 
> > Perhaps device_ids goes in the driver specific struct, I don't know.
> 
> +struct iommufd_viommu {
> +	struct iommufd_object obj;
> +	struct iommufd_ctx *ictx;
> +	struct iommu_device *iommu_dev;
> +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> +
> +	int vmid;
> +
> +	union iommu_driver_user_data {
> +		struct iommu_driver_user_vtd;
> +		struct iommu_driver_user_arm_smmuv3;
> +		struct iommu_driver_user_amd_viommu;
> +	};

Not like that, in the usual container_of way

Jason
  
Jason Gunthorpe Jan. 4, 2024, 2:36 p.m. UTC | #29
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> Per the prior discussion[1], we agreed to move the error reporting into the
> driver specific part. On Intel side, we want to report two devTLB
> invalidation errors: ICE (invalid completion error) and ITE (invalidation
> timeout error). Such errors have an additional SID information to tell
> which device failed the devTLB invalidation. I've got the below structure.

IMHO all of this complexity is a consequence of the decision to hide
the devtlb invalidation from the VM..

On the other hand I guess you want to do this because of the SIOV
troubles where the vPCI function in the VM is entirely virtual and
can't be trivially mapped to a real PCI function for ATC invalidation
like ARM and AMD can do (but they also can't support SIOV because of
this). :(

However it also makes it very confusing about how the VM would
perceive an error - eg if it invalidates an SIOV device single PASID
and that devtlb fails then the error should be connected back to the
vPCI function for the SIOV's specific PASID and not back to the
physical PCI function for the SIOV owner.

As the iommu driver itself has no idea about the vPCI functions this
seems like it is going to get really confusing. The API I suggested in
the other email is not entirely going to work as the vPCI function for
SIOV cases will have to be identified by the (struct device, PASID) -
while it would be easy enough for the iommu driver to provide the
PASID, I'm not sure how the iommufd core will relate the PASID back to
the iommu_device to understand SIOV without actually being aware of
SIOV to some degree :\

(Given SIOVr1 seems on track to be replaced by SIOVr2 so this is all a
one-off I was hoping to minimize such awareness)

Jason
  
Tian, Kevin Jan. 5, 2024, 2:16 a.m. UTC | #30
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 4, 2024 10:37 PM
> 
> On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> > Per the prior discussion[1], we agreed to move the error reporting into the
> > driver specific part. On Intel side, we want to report two devTLB
> > invalidation errors: ICE (invalid completion error) and ITE (invalidation
> > timeout error). Such errors have an additional SID information to tell
> > which device failed the devTLB invalidation. I've got the below structure.
> 
> IMHO all of this complexity is a consequence of the decision to hide
> the devtlb invalidation from the VM..
> 
> On the other hand I guess you want to do this because of the SIOV
> troubles where the vPCI function in the VM is entirely virtual and
> can't be trivially mapped to a real PCI function for ATC invalidation
> like ARM and AMD can do (but they also can't support SIOV because of
> this). :(
> 
> However it also makes it very confusing about how the VM would
> perceive an error - eg if it invalidates an SIOV device single PASID
> and that devtlb fails then the error should be connected back to the
> vPCI function for the SIOV's specific PASID and not back to the
> physical PCI function for the SIOV owner.
> 
> As the iommu driver itself has no idea about the vPCI functions this
> seems like it is going to get really confusing. The API I suggested in
> the other email is not entirely going to work as the vPCI function for
> SIOV cases will have to be identified by the (struct device, PASID) -
> while it would be easy enough for the iommu driver to provide the
> PASID, I'm not sure how the iommufd core will relate the PASID back to
> the iommu_device to understand SIOV without actually being aware of
> SIOV to some degree :\

we plan to add such awareness with a new binding helper:

https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/

and with metadata to track association between PASID's and iommufd vdev.

but in reality the relation could be identified in an easy way due to a SIOV
restriction which we discussed before - shared PASID space of PF disallows
assigning sibling vdev's to a same VM (otherwise no way to identify which
sibling vdev triggering an iopf when a pasid is used on both vdev's). That
restriction implies that within an iommufd context every iommufd_device
object should contain a unique struct device pointer. So PASID can be
instead ignored in the lookup then just always do iommufd_get_dev_id()
using struct device.
  
Tian, Kevin Jan. 5, 2024, 2:52 a.m. UTC | #31
> From: Tian, Kevin
> Sent: Friday, January 5, 2024 10:16 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, January 4, 2024 10:37 PM
> >
> > On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> > > Per the prior discussion[1], we agreed to move the error reporting into
> the
> > > driver specific part. On Intel side, we want to report two devTLB
> > > invalidation errors: ICE (invalid completion error) and ITE (invalidation
> > > timeout error). Such errors have an additional SID information to tell
> > > which device failed the devTLB invalidation. I've got the below structure.
> >
> > IMHO all of this complexity is a consequence of the decision to hide
> > the devtlb invalidation from the VM..
> >
> > On the other hand I guess you want to do this because of the SIOV
> > troubles where the vPCI function in the VM is entirely virtual and
> > can't be trivially mapped to a real PCI function for ATC invalidation
> > like ARM and AMD can do (but they also can't support SIOV because of
> > this). :(
> >
> > However it also makes it very confusing about how the VM would
> > perceive an error - eg if it invalidates an SIOV device single PASID
> > and that devtlb fails then the error should be connected back to the
> > vPCI function for the SIOV's specific PASID and not back to the
> > physical PCI function for the SIOV owner.
> >
> > As the iommu driver itself has no idea about the vPCI functions this
> > seems like it is going to get really confusing. The API I suggested in
> > the other email is not entirely going to work as the vPCI function for
> > SIOV cases will have to be identified by the (struct device, PASID) -
> > while it would be easy enough for the iommu driver to provide the
> > PASID, I'm not sure how the iommufd core will relate the PASID back to
> > the iommu_device to understand SIOV without actually being aware of
> > SIOV to some degree :\
> 
> we plan to add such awareness with a new binding helper:
> 
> https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/
> 
> and with metadata to track association between PASID's and iommufd vdev.
> 
> but in reality the relation could be identified in an easy way due to a SIOV
> restriction which we discussed before - shared PASID space of PF disallows
> assigning sibling vdev's to a same VM (otherwise no way to identify which
> sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> restriction implies that within an iommufd context every iommufd_device
> object should contain a unique struct device pointer. So PASID can be
> instead ignored in the lookup then just always do iommufd_get_dev_id()
> using struct device.

A bit more background.

Previously we thought this restriction only applies to SIOV+vSVA, as
a guest process may bind to both sibling vdev's, leading to the same
pasid situation.

In concept w/o vSVA it's still possible to assign sibling vdev's to
a same VM as each vdev is allocated with a unique pasid to mark vRID
so can be differentiated from each other in the fault/error path.

But when looking at this err code issue with Yi closely, we found
there is another gap in the VT-d spec. Upon devtlb invalidation
timeout the hw doesn't report pasid in the error info register. this
makes it impossible to identify the source vdev if a hwpt invalidation
request involves sibling vdev's from a same PF.

with that I'm inclined to always imposing this restriction for SIOV. 
One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune
but I'm with you that given SIOVr1 is one-off I prefer to limiting its
usability other than complexing the kernel.
  
Nicolin Chen Jan. 5, 2024, 7:38 a.m. UTC | #32
On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > > The driver would have to create it and there would be some driver
> > > specific enclosing struct to go with it
> > > 
> > > Perhaps device_ids goes in the driver specific struct, I don't know.
> > 
> > +struct iommufd_viommu {
> > +	struct iommufd_object obj;
> > +	struct iommufd_ctx *ictx;
> > +	struct iommu_device *iommu_dev;
> > +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> > +
> > +	int vmid;
> > +
> > +	union iommu_driver_user_data {
> > +		struct iommu_driver_user_vtd;
> > +		struct iommu_driver_user_arm_smmuv3;
> > +		struct iommu_driver_user_amd_viommu;
> > +	};
> 
> Not like that, in the usual container_of way

How about this:

// iommu.h
@@ -490,6 +490,16 @@ struct iommu_ops {
+       /* User space instance allocation and freeing by the iommu driver */
+       struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu);
+       void (*iommu_free_user)(struct iommu_device_user *iommu_user);
+       int (*iommu_user_set_dev_id)(struct iommu_device_user *iommu_user,
+                                    struct device *dev, u64 dev_id);
+       int (*iommu_user_unset_dev_id)(struct iommu_device_user *iommu_user,
+                                      struct device *dev);
+       int (*iommu_cache_invalidate_usewr)(struct iommu_device_user *iommu_user,
+                                           struct iommu_user_data_array *array);
...
+/**
+ * struct iommu_device_user - IOMMU core representation of one IOMMU virtual
+ *                            instance
+ * @iommu_dev: Underlying IOMMU hardware instance
+ * @id: Virtual instance ID, e.g. a vmid
+ */
+struct iommu_device_user {
+       struct iommu_device *iommu_dev;
+       struct xarray virtual_ids;
+       u32 id;
+};

// iommufd_private.h
+struct iommufd_viommu {
+       struct iommufd_object obj;
+       struct iommufd_ctx *ictx;
+       struct iommu_device *iommu_dev;
+       struct iommu_device_user *iommu_user;
+       struct iommufd_hwpt_paging *hwpt;
+};
+
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_viommu_destroy(struct iommufd_object *obj);

So iommu_alloc_user op allocates a driver private structure that
contains an iommu_device_user and returns &priv->iommu_user.

The set/unset_dev_id ops probably need a type argument if there
can be a multi-xarray case, then the virtual_ids xarray should
be moved to the driver private structure too?

Also, a 64-bit virtual id in the uAPI was suggested previously.
But xarray is limited to a 32-bit index? Should we compromise
the uAPI to 32-bit or is there an alternative for a 64-bit id
lookup?

Thanks
Nicolin
  
Jason Gunthorpe Jan. 5, 2024, 2:45 p.m. UTC | #33
On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
> > but in reality the relation could be identified in an easy way due to a SIOV
> > restriction which we discussed before - shared PASID space of PF disallows
> > assigning sibling vdev's to a same VM (otherwise no way to identify which
> > sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> > restriction implies that within an iommufd context every iommufd_device
> > object should contain a unique struct device pointer. So PASID can be
> > instead ignored in the lookup then just always do iommufd_get_dev_id()
> > using struct device.
> 
> A bit more background.
> 
> Previously we thought this restriction only applies to SIOV+vSVA, as
> a guest process may bind to both sibling vdev's, leading to the same
> pasid situation.
> 
> In concept w/o vSVA it's still possible to assign sibling vdev's to
> a same VM as each vdev is allocated with a unique pasid to mark vRID
> so can be differentiated from each other in the fault/error path.

I thought the SIOV plan was that each "vdev" ie vpci function would
get a slice of the pRID's PASID space statically selected at creation?

So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
pPASID into each VM.

From that view you can't identify the iommufd dev_id without knowing
both the pRID and pPASID which will disambiguate the different SIOV
iommufd dev_id instances sharing a rid.

> But when looking at this err code issue with Yi closely, we found
> there is another gap in the VT-d spec. Upon devtlb invalidation
> timeout the hw doesn't report pasid in the error info register. this
> makes it impossible to identify the source vdev if a hwpt invalidation
> request involves sibling vdev's from a same PF.

Don't you know which command timed out?
 
> with that I'm inclined to always imposing this restriction for SIOV. 
> One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune
> but I'm with you that given SIOVr1 is one-off I prefer to limiting its
> usability other than complexing the kernel.

By this you mean give up on SIOV entirely and always assign the full
pRID to an entire VM? I'm confused what restriction you mean if you
can't rely on the PASID?

Jason
  
Jason Gunthorpe Jan. 5, 2024, 3:46 p.m. UTC | #34
On Thu, Jan 04, 2024 at 11:38:40PM -0800, Nicolin Chen wrote:
> On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > > > The driver would have to create it and there would be some driver
> > > > specific enclosing struct to go with it
> > > > 
> > > > Perhaps device_ids goes in the driver specific struct, I don't know.
> > > 
> > > +struct iommufd_viommu {
> > > +	struct iommufd_object obj;
> > > +	struct iommufd_ctx *ictx;
> > > +	struct iommu_device *iommu_dev;
> > > +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> > > +
> > > +	int vmid;
> > > +
> > > +	union iommu_driver_user_data {
> > > +		struct iommu_driver_user_vtd;
> > > +		struct iommu_driver_user_arm_smmuv3;
> > > +		struct iommu_driver_user_amd_viommu;
> > > +	};
> > 
> > Not like that, in the usual container_of way
> 
> How about this:
> 
> // iommu.h
> @@ -490,6 +490,16 @@ struct iommu_ops {
> +       /* User space instance allocation and freeing by the iommu driver */
> +       struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu);

struct iommufd_viommu *iommu_alloc_viommu(struct device *dev);

> +/**
> + * struct iommu_device_user - IOMMU core representation of one IOMMU virtual
> + *                            instance
> + * @iommu_dev: Underlying IOMMU hardware instance
> + * @id: Virtual instance ID, e.g. a vmid
> + */
> +struct iommu_device_user {
> +       struct iommu_device *iommu_dev;
> +       struct xarray virtual_ids;
> +       u32 id;
> +};
> 
> // iommufd_private.h
> +struct iommufd_viommu {
> +       struct iommufd_object obj;
> +       struct iommufd_ctx *ictx;
> +       struct iommu_device *iommu_dev;
> +       struct iommu_device_user *iommu_user;
> +       struct iommufd_hwpt_paging *hwpt;
> +};

And you probably don't need two structs, just combine them together

And don't repeat the iommu_domain misdesign, there should be some
helper to init (or maybe allocate and init) the core structure along
with the driver part.

> The set/unset_dev_id ops probably need a type argument if there
> can be a multi-xarray case, then the virtual_ids xarray should
> be moved to the driver private structure too?

Yeah, probably. No need to add things that are not used right now, but
it does make some sense that the driver could control the
datastructure used for mapping. eg AMD has a HW backed datastructure
so they may not need an xarray.

> Also, a 64-bit virtual id in the uAPI was suggested previously.
> But xarray is limited to a 32-bit index? Should we compromise
> the uAPI to 32-bit or is there an alternative for a 64-bit id
> lookup?

You can use 64 bits in the uapi and forbid values that are too
large. xarry uses an unsigned long for the index so it is 64 bit on
64 bit systems.

Jason
  
Tian, Kevin Jan. 8, 2024, 4:07 a.m. UTC | #35
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, January 5, 2024 10:45 PM
> 
> On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
> > > but in reality the relation could be identified in an easy way due to a SIOV
> > > restriction which we discussed before - shared PASID space of PF
> disallows
> > > assigning sibling vdev's to a same VM (otherwise no way to identify which
> > > sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> > > restriction implies that within an iommufd context every iommufd_device
> > > object should contain a unique struct device pointer. So PASID can be
> > > instead ignored in the lookup then just always do iommufd_get_dev_id()
> > > using struct device.
> >
> > A bit more background.
> >
> > Previously we thought this restriction only applies to SIOV+vSVA, as
> > a guest process may bind to both sibling vdev's, leading to the same
> > pasid situation.
> >
> > In concept w/o vSVA it's still possible to assign sibling vdev's to
> > a same VM as each vdev is allocated with a unique pasid to mark vRID
> > so can be differentiated from each other in the fault/error path.
> 
> I thought the SIOV plan was that each "vdev" ie vpci function would
> get a slice of the pRID's PASID space statically selected at creation?
> 
> So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
> pPASID into each VM.
> 
> From that view you can't identify the iommufd dev_id without knowing
> both the pRID and pPASID which will disambiguate the different SIOV
> iommufd dev_id instances sharing a rid.

true when assigning those instances to different VMs.

Here I was talking about assigning them to a same VM being a problem.
with rid sharing plus same ENQCMD pPASID potentially used on both
instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.

we agreed before on preventing sibling vdev's in one VM for above
reason, but only as far as vSVA is concerned.

then given the new finding in err reporting I wondered whether this
restriction should be applied to all SIOV scenarios (but irrelevant now
with below explanation after another thinking)

> 
> > But when looking at this err code issue with Yi closely, we found
> > there is another gap in the VT-d spec. Upon devtlb invalidation
> > timeout the hw doesn't report pasid in the error info register. this
> > makes it impossible to identify the source vdev if a hwpt invalidation
> > request involves sibling vdev's from a same PF.
> 
> Don't you know which command timed out?

unfortunately no.

for errors related to descriptor fetch the driver can tell the command
by looking at the head pointer of the invalidation queue.

command completion is indirectly detected by inserting a wait descriptor
as fence. completion timeout error is reported in an error register. but
this register doesn't record pasid, nor does the command location. if there
are multiple pending devtlb invalidation commands upon timeout 
error the spec suggests the driver to treat all of them timeout as the
register can only record one rid.

this is kind of moot. If the driver submits only one command (plus wait)
at a time it doesn't need hw's help to identify the timeout command. 
If the driver batches invalidation commands it must treat all timeout if
an timeout error is reported.

from this angle whether to record pasid doesn't really matter.

intel-iommu driver doesn't batch commands. so it's possible for
the driver to figure out the timeout device itself and identify rid plus
pasid to find dev_id from iommufd.

Thanks
Kevin
  
Jason Gunthorpe Jan. 8, 2024, 1:51 p.m. UTC | #36
On Mon, Jan 08, 2024 at 04:07:12AM +0000, Tian, Kevin wrote:
> > > In concept w/o vSVA it's still possible to assign sibling vdev's to
> > > a same VM as each vdev is allocated with a unique pasid to mark vRID
> > > so can be differentiated from each other in the fault/error path.
> > 
> > I thought the SIOV plan was that each "vdev" ie vpci function would
> > get a slice of the pRID's PASID space statically selected at creation?
> > 
> > So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
> > pPASID into each VM.
> > 
> > From that view you can't identify the iommufd dev_id without knowing
> > both the pRID and pPASID which will disambiguate the different SIOV
> > iommufd dev_id instances sharing a rid.
> 
> true when assigning those instances to different VMs.
> 
> Here I was talking about assigning them to a same VM being a problem.
> with rid sharing plus same ENQCMD pPASID potentially used on both
> instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.

Oh you imaging sharing the pPASID if things have the same translation?
I guess I can see why, but given where things are overall I'd say just
don't do that.

Indeed we can't do that because it makes the vRID unknowable.

(again I continue to think that vt-d cache design is messed up, using
the PASID for the cache tag is a *terrible* design, and causes exactly
these kinds of problems)

> for errors related to descriptor fetch the driver can tell the command
> by looking at the head pointer of the invalidation queue.
> 
> command completion is indirectly detected by inserting a wait descriptor
> as fence. completion timeout error is reported in an error register. but
> this register doesn't record pasid, nor does the command location. if there
> are multiple pending devtlb invalidation commands upon timeout 
> error the spec suggests the driver to treat all of them timeout as the
> register can only record one rid.

Makes sense, or at least you have to re-issue them one by one

> this is kind of moot. If the driver submits only one command (plus wait)
> at a time it doesn't need hw's help to identify the timeout command. 
> If the driver batches invalidation commands it must treat all timeout if
> an timeout error is reported.

Yes
 
> from this angle whether to record pasid doesn't really matter.

At least for error handling..
 
Jason
  
Yi Liu Jan. 9, 2024, 6 a.m. UTC | #37
On 2024/1/8 12:07, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, January 5, 2024 10:45 PM
>>
>> On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
>>>> but in reality the relation could be identified in an easy way due to a SIOV
>>>> restriction which we discussed before - shared PASID space of PF
>> disallows
>>>> assigning sibling vdev's to a same VM (otherwise no way to identify which
>>>> sibling vdev triggering an iopf when a pasid is used on both vdev's). That
>>>> restriction implies that within an iommufd context every iommufd_device
>>>> object should contain a unique struct device pointer. So PASID can be
>>>> instead ignored in the lookup then just always do iommufd_get_dev_id()
>>>> using struct device.
>>>
>>> A bit more background.
>>>
>>> Previously we thought this restriction only applies to SIOV+vSVA, as
>>> a guest process may bind to both sibling vdev's, leading to the same
>>> pasid situation.
>>>
>>> In concept w/o vSVA it's still possible to assign sibling vdev's to
>>> a same VM as each vdev is allocated with a unique pasid to mark vRID
>>> so can be differentiated from each other in the fault/error path.
>>
>> I thought the SIOV plan was that each "vdev" ie vpci function would
>> get a slice of the pRID's PASID space statically selected at creation?
>>
>> So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
>> pPASID into each VM.
>>
>>  From that view you can't identify the iommufd dev_id without knowing
>> both the pRID and pPASID which will disambiguate the different SIOV
>> iommufd dev_id instances sharing a rid.
> 
> true when assigning those instances to different VMs.
> 
> Here I was talking about assigning them to a same VM being a problem.
> with rid sharing plus same ENQCMD pPASID potentially used on both
> instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.
> 
> we agreed before on preventing sibling vdev's in one VM for above
> reason, but only as far as vSVA is concerned.
> 
> then given the new finding in err reporting I wondered whether this
> restriction should be applied to all SIOV scenarios (but irrelevant now
> with below explanation after another thinking)
> 
>>
>>> But when looking at this err code issue with Yi closely, we found
>>> there is another gap in the VT-d spec. Upon devtlb invalidation
>>> timeout the hw doesn't report pasid in the error info register. this
>>> makes it impossible to identify the source vdev if a hwpt invalidation
>>> request involves sibling vdev's from a same PF.
>>
>> Don't you know which command timed out?
> 
> unfortunately no.
> 
> for errors related to descriptor fetch the driver can tell the command
> by looking at the head pointer of the invalidation queue.
> 
> command completion is indirectly detected by inserting a wait descriptor
> as fence. completion timeout error is reported in an error register. but
> this register doesn't record pasid, nor does the command location. if there
> are multiple pending devtlb invalidation commands upon timeout
> error the spec suggests the driver to treat all of them timeout as the
> register can only record one rid.
> 
> this is kind of moot. If the driver submits only one command (plus wait)
> at a time it doesn't need hw's help to identify the timeout command.
> If the driver batches invalidation commands it must treat all timeout if
> an timeout error is reported.
> 
> from this angle whether to record pasid doesn't really matter.
> 
> intel-iommu driver doesn't batch commands. so it's possible for
> the driver to figure out the timeout device itself and identify rid plus
> pasid to find dev_id from iommufd.

based on this, even RID is unnecessary. Software should know which device
it has sent a devTLB invalidation.
  

Patch

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 7f92cecc87d7..cafd98642abf 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -614,6 +614,42 @@  struct iommu_hwpt_get_dirty_bitmap {
 #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
 					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
 
+/**
+ * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
+ *                                           stage-1 cache invalidation
+ * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
+ *                            leaf PTE caching needs to be invalidated
+ *                            and other paging structure caches can be
+ *                            preserved.
+ */
+enum iommu_hwpt_vtd_s1_invalidate_flags {
+	IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *                                       (IOMMU_HWPT_DATA_VTD_S1)
+ * @addr: The start address of the addresses to be invalidated. It needs
+ *        to be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be __aligned_u64(-1). This includes the
+ * corresponding device-TLB if ATS is enabled on the attached devices.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+	__aligned_u64 addr;
+	__aligned_u64 npages;
+	__u32 flags;
+	__u32 __reserved;
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)