[rc,3/8] iommu/vt-d: Add missing iotlb flush for parent domain

Message ID 20240208082307.15759-4-yi.l.liu@intel.com
State New
Headers
Series Add missing cache flush and dirty tracking set for nested parent domain |

Commit Message

Yi Liu Feb. 8, 2024, 8:23 a.m. UTC
  If a domain is used as the parent in nested translation its mappings might
be cached using DID of the nested domain. But the existing code ignores
this fact to only invalidate the iotlb entries tagged by the domain's own
DID.

Loop the s1_domains list, if any, to invalidate all iotlb entries related
to the target s2 address range. According to VT-d spec there is no need for
software to explicitly flush the affected s1 cache. It's implicitly done by
HW when s2 cache is invalidated.

Fixes: b41e38e22539 ("iommu/vt-d: Add nested domain allocation")
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/intel/iommu.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
  

Comments

Tian, Kevin Feb. 8, 2024, 8:38 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, February 8, 2024 4:23 PM
> 
> +/*
> + * Flush the relevant caches in nested translation if the domain
> + * also serves as a parent
> + */
> +static void parent_domain_flush(struct dmar_domain *domain,
> +				unsigned long pfn,
> +				unsigned long pages, int ih)
> +{
> +	struct dmar_domain *s1_domain;
> +
> +	spin_lock(&domain->s1_lock);
> +	list_for_each_entry(s1_domain, &domain->s1_domains, s2_link) {
> +		struct iommu_domain_info *info;
> +		unsigned long i;
> +
> +		xa_for_each(&s1_domain->iommu_array, i, info)
> +			__iommu_flush_iotlb_psi(info->iommu, info->did,
> +						pfn, pages, ih);
> +	}

As Jason suggested before this xarray lacks of proper locking.

but given it's rc fix I'm fine with it. @Baolu we do need fix it soon so
this problem won't further expand like here.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
  
Baolu Lu Feb. 9, 2024, 2:40 a.m. UTC | #2
On 2024/2/8 16:38, Tian, Kevin wrote:
>> From: Liu, Yi L<yi.l.liu@intel.com>
>> Sent: Thursday, February 8, 2024 4:23 PM
>>
>> +/*
>> + * Flush the relevant caches in nested translation if the domain
>> + * also serves as a parent
>> + */
>> +static void parent_domain_flush(struct dmar_domain *domain,
>> +				unsigned long pfn,
>> +				unsigned long pages, int ih)
>> +{
>> +	struct dmar_domain *s1_domain;
>> +
>> +	spin_lock(&domain->s1_lock);
>> +	list_for_each_entry(s1_domain, &domain->s1_domains, s2_link) {
>> +		struct iommu_domain_info *info;
>> +		unsigned long i;
>> +
>> +		xa_for_each(&s1_domain->iommu_array, i, info)
>> +			__iommu_flush_iotlb_psi(info->iommu, info->did,
>> +						pfn, pages, ih);
>> +	}
> As Jason suggested before this xarray lacks of proper locking.
> 
> but given it's rc fix I'm fine with it. @Baolu we do need fix it soon so
> this problem won't further expand like here.

Yes. I planned to fix the locking issue in a separated series.

Best regards,
baolu
  
Jason Gunthorpe Feb. 21, 2024, 3:19 p.m. UTC | #3
On Thu, Feb 08, 2024 at 12:23:02AM -0800, Yi Liu wrote:
> If a domain is used as the parent in nested translation its mappings might
> be cached using DID of the nested domain. But the existing code ignores
> this fact to only invalidate the iotlb entries tagged by the domain's own
> DID.

> Loop the s1_domains list, if any, to invalidate all iotlb entries related
> to the target s2 address range. According to VT-d spec there is no need for
> software to explicitly flush the affected s1 cache. It's implicitly done by
> HW when s2 cache is invalidated.

I had to look this up to understand what it means.. The HW caches
per-DID and if you invalidate the DID's S2 then the HW flushes the S1
as well within that DID only.

It doesn't mean that the S2 is globally shared across all the nesting
translations (like ARM does), and you still have to iterate over every
nesting DID.

In light of that this design seems to have gone a bit off..

A domain should have a list of places that need invalidation,
specifically a list of DIDs and ATCs that need an invalidation to be
issued.

Instead we now somehow have 4 different lists in the domain the
invalidation code iterates over?

So I would think this:

struct dmar_domain {
	struct xarray iommu_array;	/* Attached IOMMU array */
	struct list_head devices;	/* all devices' list */
	struct list_head dev_pasids;	/* all attached pasids */
	struct list_head s1_domains;

Would make sense to be collapsed into one logical list of attached
devices:

struct intel_iommu_domain_attachment {
   unsigned int did;
   ioasid_t pasid;
   struct device_domain_info *info;
   list_head item;
};

When you attach a S1/S2 nest you allocate two of the above structs and
one is linked on the S1 and one is linked on the S2..

Jason
  
Yi Liu Feb. 22, 2024, 8:34 a.m. UTC | #4
On 2024/2/21 23:19, Jason Gunthorpe wrote:
> On Thu, Feb 08, 2024 at 12:23:02AM -0800, Yi Liu wrote:
>> If a domain is used as the parent in nested translation its mappings might
>> be cached using DID of the nested domain. But the existing code ignores
>> this fact to only invalidate the iotlb entries tagged by the domain's own
>> DID.
> 
>> Loop the s1_domains list, if any, to invalidate all iotlb entries related
>> to the target s2 address range. According to VT-d spec there is no need for
>> software to explicitly flush the affected s1 cache. It's implicitly done by
>> HW when s2 cache is invalidated.
> 
> I had to look this up to understand what it means.. The HW caches
> per-DID and if you invalidate the DID's S2 then the HW flushes the S1
> as well within that DID only.

yes.

> 
> It doesn't mean that the S2 is globally shared across all the nesting
> translations (like ARM does), and you still have to iterate over every
> nesting DID.
> 
> In light of that this design seems to have gone a bit off..
> 
> A domain should have a list of places that need invalidation,
> specifically a list of DIDs and ATCs that need an invalidation to be
> issued.
> 
> Instead we now somehow have 4 different lists in the domain the
> invalidation code iterates over?
> 
> So I would think this:
> 
> struct dmar_domain {
> 	struct xarray iommu_array;	/* Attached IOMMU array */
> 	struct list_head devices;	/* all devices' list */
> 	struct list_head dev_pasids;	/* all attached pasids */
> 	struct list_head s1_domains;
> 
> Would make sense to be collapsed into one logical list of attached
> devices:
> 
> struct intel_iommu_domain_attachment {
>     unsigned int did;
>     ioasid_t pasid;
>     struct device_domain_info *info;
>     list_head item;
> };
> 
> When you attach a S1/S2 nest you allocate two of the above structs and
> one is linked on the S1 and one is linked on the S2..

yes, this shall work. ATC flushing should be fine. But there can be a
drawback when flushing DIDs. VT-d side, if there are multiple devices
behind the same iommu unit, just need one DID flushing is enough. But
after the above change, the number of DID flushing would increase per
the number of devices. Although no functional issue, but it submits
duplicated invalidation.
  
Jason Gunthorpe Feb. 22, 2024, 3:16 p.m. UTC | #5
On Thu, Feb 22, 2024 at 04:34:10PM +0800, Yi Liu wrote:
> > It doesn't mean that the S2 is globally shared across all the nesting
> > translations (like ARM does), and you still have to iterate over every
> > nesting DID.
> > 
> > In light of that this design seems to have gone a bit off..
> > 
> > A domain should have a list of places that need invalidation,
> > specifically a list of DIDs and ATCs that need an invalidation to be
> > issued.
> > 
> > Instead we now somehow have 4 different lists in the domain the
> > invalidation code iterates over?
> > 
> > So I would think this:
> > 
> > struct dmar_domain {
> > 	struct xarray iommu_array;	/* Attached IOMMU array */
> > 	struct list_head devices;	/* all devices' list */
> > 	struct list_head dev_pasids;	/* all attached pasids */
> > 	struct list_head s1_domains;
> > 
> > Would make sense to be collapsed into one logical list of attached
> > devices:
> > 
> > struct intel_iommu_domain_attachment {
> >     unsigned int did;
> >     ioasid_t pasid;
> >     struct device_domain_info *info;
> >     list_head item;
> > };
> > 
> > When you attach a S1/S2 nest you allocate two of the above structs and
> > one is linked on the S1 and one is linked on the S2..
> 
> yes, this shall work. ATC flushing should be fine. But there can be a
> drawback when flushing DIDs. VT-d side, if there are multiple devices
> behind the same iommu unit, just need one DID flushing is enough. But
> after the above change, the number of DID flushing would increase per
> the number of devices. Although no functional issue, but it submits
> duplicated invalidation.

At least the three server drivers all have this same problem, I would
like to seem some core code to help solve it, since it is tricky and
the RCU idea is good..

Collapsing invalidations can be done by sorting, I think this was
Robin's suggestion.  But we could also easially maintain multiple list
threads hidden inside the API, or allocate a multi-level list.

Something very approximately like this:

Jason

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d14413916f93a0..7b2de139e7c437 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -38,6 +38,8 @@
 
 #include "iommu-sva.h"
 
+#include <linux/iommu-alist.h>
+
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
 static DEFINE_IDA(iommu_global_pasid_ida);
diff --git a/include/linux/iommu-alist.h b/include/linux/iommu-alist.h
new file mode 100644
index 00000000000000..7a9243f8b2b5f8
--- /dev/null
+++ b/include/linux/iommu-alist.h
@@ -0,0 +1,126 @@
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/iommu.h>
+
+/*
+ * Place in an iommu_domain to keep track of what devices have been attached to
+ * it.
+ */
+struct iommu_attach_list {
+	spinlock_t lock;
+	struct list_head all;
+};
+
+/*
+ * Allocate for every time a device is attached to a domain
+ */
+struct iommu_attach_elm {
+	/*
+	 * Note that this pointer is accessed under RCU, the driver has to be
+	 * careful to rcu free it.
+	 */
+	struct iommu_device *iommu_device;
+	ioasid_t pasid;
+	u8 using_atc;
+	struct list_head all_item;
+
+	/* May not be accessed under RCU! */
+	struct device *device;
+};
+
+void iommu_attach_init(struct iommu_attach_list *alist)
+{
+	spin_lock_init(&alist->lock);
+}
+
+int iommu_attach_add(struct iommu_attach_list *alist,
+		     struct iommu_attach_elm *elm)
+{
+	struct list_head *insert_after;
+
+	spin_lock(&alist->lock);
+	insert_after = list_find_sorted_insertion(alist, elm, cmp);
+	list_add_rcu(&elm->all_item, insert_after);
+	spin_unlock(&alist->lock);
+}
+
+void iommu_attach_del_free(struct iommu_attach_list *alist, struct iommu_attach_elm *elm)
+{
+	spin_lock(&alist->lock);
+	list_del_rcu(&elm->all_item);
+	spin_unlock(&alist->lock);
+	/* assumes 0 offset */
+	kfree_rcu_mightsleep(elm);
+}
+
+static struct iommu_attach_elm *
+__iommu_attach_next_iommu(struct iommu_attach_list *alist,
+			  struct iommu_attach_elm *pos,
+			  struct iommu_attach_elm *last)
+{
+	struct iommu_attach_elm *next;
+
+	do {
+		next = list_next_or_null_rcu(&alist->all, &pos->all_item,
+					     struct iommu_attach_elm, all_item);
+		if (!next)
+			return NULL;
+		if (!last)
+			return next;
+	} while (pos->iommu_device == next->iommu_device);
+	return next;
+}
+
+/* assumes 0 offset */
+#define iommu_attach_next_iommu(alist, pos, last, member)             \
+	container_of(__iommu_attach_next_iommu(alist, &(pos)->member, \
+					       &(last)->member),      \
+		     typeof(*pos), member)
+
+#define iommu_attach_for_each_iommu(pos, last, alist, member)              \
+	for (({                                                            \
+		     RCU_LOCKDEP_WARN(                                     \
+			     !rcu_read_lock_any_held(),                    \
+			     "RCU-list traversed in non-reader section!"); \
+	     }),                                                           \
+	     pos = list_first_or_null_rcu(&(alist)->all, typeof(*pos),     \
+					  member.all_item),                \
+	     last = NULL;                                                  \
+	     pos; pos = iommu_attach_next_iommu(alist, pos, last, member))
+
+/* Example */
+struct driver_domain {
+	struct iommu_domain domain;
+	struct iommu_attach_list alist;
+};
+
+struct driver_attach_elm {
+	struct iommu_attach_elm aelm;
+	unsigned int cache_tag;
+};
+
+static void example(struct driver_domain *domain)
+{
+	struct driver_attach_elm *elm;
+	struct driver_attach_elm *pos, *last;
+	bool need_atc;
+
+	iommu_attach_init(&domain->alist);
+
+	elm = kzalloc(sizeof(*elm), GFP_KERNEL);
+	iommu_attach_add(&domain->alist, &elm->aelm);
+
+	rcu_read_lock();
+	iommu_attach_for_each_iommu(pos, last, &domain->alist, aelm) {
+		invalidate_iommu(elm);
+		need_atc |= elm->aelm.using_atc;
+	}
+	if (need_atc) {
+		list_for_each_entry_rcu(pos, &domain->alist.all,
+					aelm.all_item) {
+			if (pos->aelm.using_atc)
+				invalidate_atc(elm);
+		}
+	}
+	rcu_read_unlock();
+}
  

Patch

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index eef6a187b651..46174ff5ae22 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1452,6 +1452,28 @@  static void __mapping_notify_one(struct intel_iommu *iommu, struct dmar_domain *
 		iommu_flush_write_buffer(iommu);
 }
 
+/*
+ * Flush the relevant caches in nested translation if the domain
+ * also serves as a parent
+ */
+static void parent_domain_flush(struct dmar_domain *domain,
+				unsigned long pfn,
+				unsigned long pages, int ih)
+{
+	struct dmar_domain *s1_domain;
+
+	spin_lock(&domain->s1_lock);
+	list_for_each_entry(s1_domain, &domain->s1_domains, s2_link) {
+		struct iommu_domain_info *info;
+		unsigned long i;
+
+		xa_for_each(&s1_domain->iommu_array, i, info)
+			__iommu_flush_iotlb_psi(info->iommu, info->did,
+						pfn, pages, ih);
+	}
+	spin_unlock(&domain->s1_lock);
+}
+
 static void intel_flush_iotlb_all(struct iommu_domain *domain)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
@@ -1471,6 +1493,9 @@  static void intel_flush_iotlb_all(struct iommu_domain *domain)
 		if (!cap_caching_mode(iommu->cap))
 			iommu_flush_dev_iotlb(dmar_domain, 0, MAX_AGAW_PFN_WIDTH);
 	}
+
+	if (dmar_domain->nested_parent)
+		parent_domain_flush(dmar_domain, 0, -1, 0);
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -1994,6 +2019,9 @@  static void switch_to_super_page(struct dmar_domain *domain,
 				iommu_flush_iotlb_psi(info->iommu, domain,
 						      start_pfn, lvl_pages,
 						      0, 0);
+			if (domain->nested_parent)
+				parent_domain_flush(domain, start_pfn,
+						    lvl_pages, 0);
 		}
 
 		pte++;
@@ -4126,6 +4154,9 @@  static void intel_iommu_tlb_sync(struct iommu_domain *domain,
 				      start_pfn, nrpages,
 				      list_empty(&gather->freelist), 0);
 
+	if (dmar_domain->nested_parent)
+		parent_domain_flush(dmar_domain, start_pfn, nrpages,
+				    list_empty(&gather->freelist));
 	put_pages_list(&gather->freelist);
 }