[v2,1/2] iommu/virtio: Make use of ops->iotlb_sync_map

Message ID 20230918-viommu-sync-map-v2-1-f33767f6cf7a@linux.ibm.com
State New
Headers
Series iommu/virtio: Enable IOMMU_CAP_DERRED_FLUSH |

Commit Message

Niklas Schnelle Sept. 18, 2023, 11:51 a.m. UTC
  Pull out the sync operation from viommu_map_pages() by implementing
ops->iotlb_sync_map. This allows the common IOMMU code to map multiple
elements of an sg with a single sync (see iommu_map_sg()). Furthermore,
it is also a requirement for IOMMU_CAP_DEFERRED_FLUSH.

Link: https://lore.kernel.org/lkml/20230726111433.1105665-1-schnelle@linux.ibm.com/
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/virtio-iommu.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)
  

Comments

Niklas Schnelle Sept. 19, 2023, 8 a.m. UTC | #1
On Mon, 2023-09-18 at 17:37 +0100, Robin Murphy wrote:
> On 2023-09-18 12:51, Niklas Schnelle wrote:
> > Pull out the sync operation from viommu_map_pages() by implementing
> > ops->iotlb_sync_map. This allows the common IOMMU code to map multiple
> > elements of an sg with a single sync (see iommu_map_sg()). Furthermore,
> > it is also a requirement for IOMMU_CAP_DEFERRED_FLUSH.
> 
> Is it really a requirement? Deferred flush only deals with unmapping. Or 
> are you just trying to say that it's not too worthwhile to try doing 
> more for unmapping performance while obvious mapping performance is 
> still left on the table?
> 

You're right there is no hard requirement. I somehow thought that
iommu_create_device_direct_map() relied on it because it does
flush_iotlb_all() and iommu_map() but that doesn't seem to be true. If
you want I can resend with the last sentence removed.

> > Link: https://lore.kernel.org/lkml/20230726111433.1105665-1-schnelle@linux.ibm.com/
> > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > ---
> >   drivers/iommu/virtio-iommu.c | 17 ++++++++++++++++-
> >   1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > index 17dcd826f5c2..3649586f0e5c 100644
> > --- a/drivers/iommu/virtio-iommu.c
> > +++ b/drivers/iommu/virtio-iommu.c
> > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> >   	int ret;
> >   	unsigned long flags;
> >   
> > +	/*
> > +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > +	 * is initialized e.g. via iommu_create_device_direct_mappings()
> > +	 */
> > +	if (!viommu)
> > +		return 0;
> 
> Minor nit: I'd be inclined to make that check explicitly in the places 
> where it definitely is expected, rather than allowing *any* sync to 
> silently do nothing if called incorrectly. Plus then they could use 
> vdomain->nr_endpoints for consistency with the equivalent checks 
> elsewhere (it did take me a moment to figure out how we could get to 
> .iotlb_sync_map with a NULL viommu without viommu_map_pages() blowing up 
> first...)
> 
> Thanks,
> Robin.

That's what I had in v1. I think this is a matter of taste and Jean-
Philippe pointed me to moving the check into viommu_sync_req() I added
a comment because it really is not entirely obvious.

> 
> >   	spin_lock_irqsave(&viommu->request_lock, flags);
> >   	ret = __viommu_sync_req(viommu);
> >   	if (ret)
> > @@ -843,7 +849,7 @@ static int viommu_map_pages(struct iommu_domain *domain, unsigned long iova,
> >   			.flags		= cpu_to_le32(flags),
> >   		};
> >   
> > -		ret = viommu_send_req_sync(vdomain->viommu, &map, sizeof(map));
> > +		ret = viommu_add_req(vdomain->viommu, &map, sizeof(map));
> >   		if (ret) {
> >   			viommu_del_mappings(vdomain, iova, end);
> >   			return ret;
> > @@ -912,6 +918,14 @@ static void viommu_iotlb_sync(struct iommu_domain *domain,
> >   	viommu_sync_req(vdomain->viommu);
> >   }
> >   
> > +static int viommu_iotlb_sync_map(struct iommu_domain *domain,
> > +				 unsigned long iova, size_t size)
> > +{
> > +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> > +
> > +	return viommu_sync_req(vdomain->viommu);
> > +}
> > +
> >   static void viommu_get_resv_regions(struct device *dev, struct list_head *head)
> >   {
> >   	struct iommu_resv_region *entry, *new_entry, *msi = NULL;
> > @@ -1058,6 +1072,7 @@ static struct iommu_ops viommu_ops = {
> >   		.unmap_pages		= viommu_unmap_pages,
> >   		.iova_to_phys		= viommu_iova_to_phys,
> >   		.iotlb_sync		= viommu_iotlb_sync,
> > +		.iotlb_sync_map		= viommu_iotlb_sync_map,
> >   		.free			= viommu_domain_free,
> >   	}
> >   };
> >
  
Jean-Philippe Brucker Sept. 19, 2023, 8:15 a.m. UTC | #2
On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
> > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > index 17dcd826f5c2..3649586f0e5c 100644
> > --- a/drivers/iommu/virtio-iommu.c
> > +++ b/drivers/iommu/virtio-iommu.c
> > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> >   	int ret;
> >   	unsigned long flags;
> > +	/*
> > +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > +	 * is initialized e.g. via iommu_create_device_direct_mappings()
> > +	 */
> > +	if (!viommu)
> > +		return 0;
> 
> Minor nit: I'd be inclined to make that check explicitly in the places where
> it definitely is expected, rather than allowing *any* sync to silently do
> nothing if called incorrectly. Plus then they could use
> vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
> (it did take me a moment to figure out how we could get to .iotlb_sync_map
> with a NULL viommu without viommu_map_pages() blowing up first...)

They're not strictly equivalent: this check works around a temporary issue
with the IOMMU core, which calls map/unmap before the domain is finalized.
Once we merge domain_alloc() and finalize(), then this check disappears,
but we still need to test nr_endpoints in map/unmap to handle detached
domains (and we still need to fix the synchronization of nr_endpoints
against attach/detach). That's why I preferred doing this on viommu and
keeping it in one place.

Thanks,
Jean
  
Robin Murphy Sept. 19, 2023, 8:28 a.m. UTC | #3
On 2023-09-19 09:15, Jean-Philippe Brucker wrote:
> On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
>>> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
>>> index 17dcd826f5c2..3649586f0e5c 100644
>>> --- a/drivers/iommu/virtio-iommu.c
>>> +++ b/drivers/iommu/virtio-iommu.c
>>> @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
>>>    	int ret;
>>>    	unsigned long flags;
>>> +	/*
>>> +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
>>> +	 * is initialized e.g. via iommu_create_device_direct_mappings()
>>> +	 */
>>> +	if (!viommu)
>>> +		return 0;
>>
>> Minor nit: I'd be inclined to make that check explicitly in the places where
>> it definitely is expected, rather than allowing *any* sync to silently do
>> nothing if called incorrectly. Plus then they could use
>> vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
>> (it did take me a moment to figure out how we could get to .iotlb_sync_map
>> with a NULL viommu without viommu_map_pages() blowing up first...)
> 
> They're not strictly equivalent: this check works around a temporary issue
> with the IOMMU core, which calls map/unmap before the domain is finalized.
> Once we merge domain_alloc() and finalize(), then this check disappears,
> but we still need to test nr_endpoints in map/unmap to handle detached
> domains (and we still need to fix the synchronization of nr_endpoints
> against attach/detach). That's why I preferred doing this on viommu and
> keeping it in one place.

Fair enough - it just seems to me that in both cases it's a detached 
domain, so its previous history of whether it's ever been otherwise or 
not shouldn't matter. Even once viommu is initialised, does it really 
make sense to send sync commands for a mapping on a detached domain 
where we haven't actually sent any map/unmap commands?

Thanks,
Robin.
  
Jason Gunthorpe Sept. 19, 2023, 2:46 p.m. UTC | #4
On Tue, Sep 19, 2023 at 09:15:19AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
> > > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > > index 17dcd826f5c2..3649586f0e5c 100644
> > > --- a/drivers/iommu/virtio-iommu.c
> > > +++ b/drivers/iommu/virtio-iommu.c
> > > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> > >   	int ret;
> > >   	unsigned long flags;
> > > +	/*
> > > +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > > +	 * is initialized e.g. via iommu_create_device_direct_mappings()
> > > +	 */
> > > +	if (!viommu)
> > > +		return 0;
> > 
> > Minor nit: I'd be inclined to make that check explicitly in the places where
> > it definitely is expected, rather than allowing *any* sync to silently do
> > nothing if called incorrectly. Plus then they could use
> > vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
> > (it did take me a moment to figure out how we could get to .iotlb_sync_map
> > with a NULL viommu without viommu_map_pages() blowing up first...)

This makes more sense to me

Ultimately this driver should reach a point where every iommu_domain
always has a non-null domain->viommu because it will be set during
alloc.

But it can still have nr_endpoints == 0, doesn't it make sense to
avoid sync in this case?

(btw this driver is missing locking around vdomain->nr_endpoints)

> They're not strictly equivalent: this check works around a temporary issue
> with the IOMMU core, which calls map/unmap before the domain is
> finalized.

Where? The above points to iommu_create_device_direct_mappings() but
it doesn't because the pgsize_bitmap == 0:

static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
					       struct device *dev)
{
	struct iommu_resv_region *entry;
	struct list_head mappings;
	unsigned long pg_size;
	int ret = 0;

	pg_size = domain->pgsize_bitmap ? 1UL << __ffs(domain->pgsize_bitmap) : 0;
	INIT_LIST_HEAD(&mappings);

	if (WARN_ON_ONCE(iommu_is_dma_domain(domain) && !pg_size))

Indeed, the driver should be failing all map's until the domain is
finalized because it has no way to check the IOVA matches the eventual
aperture.

Jason
  
Jean-Philippe Brucker Sept. 22, 2023, 7:52 a.m. UTC | #5
On Tue, Sep 19, 2023 at 09:28:08AM +0100, Robin Murphy wrote:
> On 2023-09-19 09:15, Jean-Philippe Brucker wrote:
> > On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
> > > > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > > > index 17dcd826f5c2..3649586f0e5c 100644
> > > > --- a/drivers/iommu/virtio-iommu.c
> > > > +++ b/drivers/iommu/virtio-iommu.c
> > > > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> > > >    	int ret;
> > > >    	unsigned long flags;
> > > > +	/*
> > > > +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > > > +	 * is initialized e.g. via iommu_create_device_direct_mappings()
> > > > +	 */
> > > > +	if (!viommu)
> > > > +		return 0;
> > > 
> > > Minor nit: I'd be inclined to make that check explicitly in the places where
> > > it definitely is expected, rather than allowing *any* sync to silently do
> > > nothing if called incorrectly. Plus then they could use
> > > vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
> > > (it did take me a moment to figure out how we could get to .iotlb_sync_map
> > > with a NULL viommu without viommu_map_pages() blowing up first...)
> > 
> > They're not strictly equivalent: this check works around a temporary issue
> > with the IOMMU core, which calls map/unmap before the domain is finalized.
> > Once we merge domain_alloc() and finalize(), then this check disappears,
> > but we still need to test nr_endpoints in map/unmap to handle detached
> > domains (and we still need to fix the synchronization of nr_endpoints
> > against attach/detach). That's why I preferred doing this on viommu and
> > keeping it in one place.
> 
> Fair enough - it just seems to me that in both cases it's a detached domain,
> so its previous history of whether it's ever been otherwise or not shouldn't
> matter. Even once viommu is initialised, does it really make sense to send
> sync commands for a mapping on a detached domain where we haven't actually
> sent any map/unmap commands?

If no requests were added by map/unmap, then viommu_sync_req() is
essentially a nop because virtio doesn't use sync commands (and
virtqueue_kick() only kicks the host when the queue's not empty, if I
remember correctly). It still does a bit of work so is less efficient than
a preliminary check on nr_endpoints, but it feels nicer to streamline the
case where the domain is attached, since map/unmap on detached domains
happens rarely, if ever.

Either is fine by me. An extra test won't make much difference performance
wise, and I guess will look less confusing. Niklas, do you mind resending
the version with nr_endpoints check (and updated commit messages)?

Thanks,
Jean
  
Jean-Philippe Brucker Sept. 22, 2023, 7:57 a.m. UTC | #6
On Tue, Sep 19, 2023 at 11:46:49AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 19, 2023 at 09:15:19AM +0100, Jean-Philippe Brucker wrote:
> > On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote:
> > > > diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> > > > index 17dcd826f5c2..3649586f0e5c 100644
> > > > --- a/drivers/iommu/virtio-iommu.c
> > > > +++ b/drivers/iommu/virtio-iommu.c
> > > > @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu)
> > > >   	int ret;
> > > >   	unsigned long flags;
> > > > +	/*
> > > > +	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
> > > > +	 * is initialized e.g. via iommu_create_device_direct_mappings()
> > > > +	 */
> > > > +	if (!viommu)
> > > > +		return 0;
> > > 
> > > Minor nit: I'd be inclined to make that check explicitly in the places where
> > > it definitely is expected, rather than allowing *any* sync to silently do
> > > nothing if called incorrectly. Plus then they could use
> > > vdomain->nr_endpoints for consistency with the equivalent checks elsewhere
> > > (it did take me a moment to figure out how we could get to .iotlb_sync_map
> > > with a NULL viommu without viommu_map_pages() blowing up first...)
> 
> This makes more sense to me
> 
> Ultimately this driver should reach a point where every iommu_domain
> always has a non-null domain->viommu because it will be set during
> alloc.
> 
> But it can still have nr_endpoints == 0, doesn't it make sense to
> avoid sync in this case?
> 
> (btw this driver is missing locking around vdomain->nr_endpoints)

Yes, that's on my list

> 
> > They're not strictly equivalent: this check works around a temporary issue
> > with the IOMMU core, which calls map/unmap before the domain is
> > finalized.
> 
> Where? The above points to iommu_create_device_direct_mappings() but
> it doesn't because the pgsize_bitmap == 0:

__iommu_domain_alloc() sets pgsize_bitmap in this case:

        /*
         * If not already set, assume all sizes by default; the driver
         * may override this later
         */
        if (!domain->pgsize_bitmap)
                domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;

Thanks,
Jean
  
Jason Gunthorpe Sept. 22, 2023, 12:41 p.m. UTC | #7
On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
> > > They're not strictly equivalent: this check works around a temporary issue
> > > with the IOMMU core, which calls map/unmap before the domain is
> > > finalized.
> > 
> > Where? The above points to iommu_create_device_direct_mappings() but
> > it doesn't because the pgsize_bitmap == 0:
> 
> __iommu_domain_alloc() sets pgsize_bitmap in this case:
> 
>         /*
>          * If not already set, assume all sizes by default; the driver
>          * may override this later
>          */
>         if (!domain->pgsize_bitmap)
>                 domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;

Dirver's shouldn't do that.

The core code was fixed to try again with mapping reserved regions to
support these kinds of drivers.

Jason
  
Robin Murphy Sept. 22, 2023, 1:13 p.m. UTC | #8
On 22/09/2023 1:41 pm, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
>>>> They're not strictly equivalent: this check works around a temporary issue
>>>> with the IOMMU core, which calls map/unmap before the domain is
>>>> finalized.
>>>
>>> Where? The above points to iommu_create_device_direct_mappings() but
>>> it doesn't because the pgsize_bitmap == 0:
>>
>> __iommu_domain_alloc() sets pgsize_bitmap in this case:
>>
>>          /*
>>           * If not already set, assume all sizes by default; the driver
>>           * may override this later
>>           */
>>          if (!domain->pgsize_bitmap)
>>                  domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;
> 
> Dirver's shouldn't do that.
> 
> The core code was fixed to try again with mapping reserved regions to
> support these kinds of drivers.

This is still the "normal" code path, really; I think it's only AMD that 
started initialising the domain bitmap "early" and warranted making it 
conditional. However we *do* ultimately want all the drivers to do the 
same, so we can get rid of ops->pgsize_bitmap, because it's already 
pretty redundant and meaningless in the face of per-domain pagetable 
formats.

Thanks,
Robin.
  
Jason Gunthorpe Sept. 22, 2023, 4:27 p.m. UTC | #9
On Fri, Sep 22, 2023 at 02:13:18PM +0100, Robin Murphy wrote:
> On 22/09/2023 1:41 pm, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
> > > > > They're not strictly equivalent: this check works around a temporary issue
> > > > > with the IOMMU core, which calls map/unmap before the domain is
> > > > > finalized.
> > > > 
> > > > Where? The above points to iommu_create_device_direct_mappings() but
> > > > it doesn't because the pgsize_bitmap == 0:
> > > 
> > > __iommu_domain_alloc() sets pgsize_bitmap in this case:
> > > 
> > >          /*
> > >           * If not already set, assume all sizes by default; the driver
> > >           * may override this later
> > >           */
> > >          if (!domain->pgsize_bitmap)
> > >                  domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;
> > 
> > Dirver's shouldn't do that.
> > 
> > The core code was fixed to try again with mapping reserved regions to
> > support these kinds of drivers.
> 
> This is still the "normal" code path, really; I think it's only AMD that
> started initialising the domain bitmap "early" and warranted making it
> conditional.

My main point was that iommu_create_device_direct_mappings() should
fail for unfinalized domains, setting pgsize_bitmap to allow it to
succeed is not a nice hack, and not necessary now.

What do you think about something like this to replace
iommu_create_device_direct_mappings(), that does enforce things
properly?


static int resv_cmp(void *priv, const struct list_head *llhs,
		    const struct list_head *lrhs)
{
	struct iommu_resv_region *lhs = list_entry(llhs, struct iommu_resv_region, list);
	struct iommu_resv_region *rhs = list_entry(lrhs, struct iommu_resv_region, list);

	if (lhs->start == rhs->start)
		return 0;
	if (lhs->start < rhs->start)
		return -1;
	return 1;
}

static int iommu_create_device_direct_mappings(struct iommu_domain *domain,
					       struct device *dev)
{
	struct iommu_resv_region *entry;
	struct iommu_resv_region *tmp;
	struct list_head mappings;
	struct list_head direct;
	phys_addr_t cur = 0;
	int ret = 0;

	INIT_LIST_HEAD(&mappings);
	INIT_LIST_HEAD(&direct);

	iommu_get_resv_regions(dev, &mappings);

	list_for_each_entry_safe(entry, tmp, &mappings, list) {
		if (entry->type == IOMMU_RESV_DIRECT)
			dev->iommu->require_direct = 1;

		if ((domain->type & __IOMMU_DOMAIN_PAGING) &&
		    (entry->type == IOMMU_RESV_DIRECT ||
		     entry->type == IOMMU_RESV_DIRECT_RELAXABLE)) {
			if (domain->geometry.aperture_start > entry->start ||
			    domain->geometry.aperture_end == 0 ||
			    (domain->geometry.aperture_end - 1) <
				    (entry->start + entry->length - 1)) {
				ret = -EINVAL;
				goto out;
			}
			list_move(&entry->list, &direct);
		}
	}

	if (list_empty(&direct))
		goto out;

	/*
	 * FW can have overlapping ranges, sort the list by start address
	 * and map any duplicated IOVA only once.
	 */
	list_sort(NULL, &direct, resv_cmp);
	list_for_each_entry(entry, &direct, list) {
		phys_addr_t start_pfn = entry->start / PAGE_SIZE;
		phys_addr_t last_pfn =
			(entry->length - 1 + entry->start) / PAGE_SIZE;

		if (start_pfn < cur)
			start_pfn = cur;

		if (start_pfn <= last_pfn) {
			ret = iommu_map(domain, start_pfn * PAGE_SIZE,
					start_pfn * PAGE_SIZE,
					(last_pfn - start_pfn + 1) * PAGE_SIZE,
					entry->prot, GFP_KERNEL);
			if (ret)
				goto out;
			cur = last_pfn + 1;
		}
	}

out:
	list_splice(&direct, &mappings);
	iommu_put_resv_regions(dev, &mappings);

	return ret;
}
  
Robin Murphy Sept. 22, 2023, 6:07 p.m. UTC | #10
On 22/09/2023 5:27 pm, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 02:13:18PM +0100, Robin Murphy wrote:
>> On 22/09/2023 1:41 pm, Jason Gunthorpe wrote:
>>> On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote:
>>>>>> They're not strictly equivalent: this check works around a temporary issue
>>>>>> with the IOMMU core, which calls map/unmap before the domain is
>>>>>> finalized.
>>>>>
>>>>> Where? The above points to iommu_create_device_direct_mappings() but
>>>>> it doesn't because the pgsize_bitmap == 0:
>>>>
>>>> __iommu_domain_alloc() sets pgsize_bitmap in this case:
>>>>
>>>>           /*
>>>>            * If not already set, assume all sizes by default; the driver
>>>>            * may override this later
>>>>            */
>>>>           if (!domain->pgsize_bitmap)
>>>>                   domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap;
>>>
>>> Dirver's shouldn't do that.
>>>
>>> The core code was fixed to try again with mapping reserved regions to
>>> support these kinds of drivers.
>>
>> This is still the "normal" code path, really; I think it's only AMD that
>> started initialising the domain bitmap "early" and warranted making it
>> conditional.
> 
> My main point was that iommu_create_device_direct_mappings() should
> fail for unfinalized domains, setting pgsize_bitmap to allow it to
> succeed is not a nice hack, and not necessary now.

Sure, but it's the whole "unfinalised domains" and rewriting 
domain->pgsize_bitmap after attach thing that is itself the massive 
hack. AMD doesn't do that, and doesn't need to; it knows the appropriate 
format at allocation time and can quite happily return a fully working 
domain which allows map before attach, but the old ops->pgsize_bitmap 
mechanism fundamentally doesn't work for multiple formats with different 
page sizes. The only thing I'd accuse it of doing wrong is the weird 
half-and-half thing of having one format as a default via one mechanism, 
and the other as an override through the other, rather than setting both 
explicitly.

virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings 
either; it sets it once it's discovered any instance, since apparently 
it's assuming that all instances must support identical page sizes, and 
thus once it's seen one it can work "normally" per the core code's 
assumptions. It's also I think the only driver which has a "finalise" 
bodge but *can* still properly support map-before-attach, by virtue of 
having to replay mappings to every new endpoint anyway.

> What do you think about something like this to replace
> iommu_create_device_direct_mappings(), that does enforce things
> properly?

I fail to see how that would make any practical difference. Either the 
mappings can be correctly set up in a pagetable *before* the relevant 
device is attached to that pagetable, or they can't (if the driver 
doesn't have enough information to be able to do so) and we just have to 
really hope nothing blows up in the race window between attaching the 
device to an empty pagetable and having a second try at 
iommu_create_device_direct_mappings(). That's a driver-level issue and 
has nothing to do with pgsize_bitmap either way.

Thanks,
Robin.
  
Jason Gunthorpe Sept. 22, 2023, 11:33 p.m. UTC | #11
On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:

> virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> either; it sets it once it's discovered any instance, since apparently it's
> assuming that all instances must support identical page sizes, and thus once
> it's seen one it can work "normally" per the core code's assumptions. It's
> also I think the only driver which has a "finalise" bodge but *can* still
> properly support map-before-attach, by virtue of having to replay mappings
> to every new endpoint anyway.

Well it can't quite do that since it doesn't know the geometry - it
all is sort of guessing and hoping it doesn't explode on replay. If it
knows the geometry it wouldn't need finalize...

> > What do you think about something like this to replace
> > iommu_create_device_direct_mappings(), that does enforce things
> > properly?
> 
> I fail to see how that would make any practical difference. Either the
> mappings can be correctly set up in a pagetable *before* the relevant device
> is attached to that pagetable, or they can't (if the driver doesn't have
> enough information to be able to do so) and we just have to really hope
> nothing blows up in the race window between attaching the device to an empty
> pagetable and having a second try at iommu_create_device_direct_mappings().
> That's a driver-level issue and has nothing to do with pgsize_bitmap either
> way.

Except we don't detect this in the core code correctly, that is my
point. We should detect the aperture conflict, not pgsize_bitmap to
check if it is the first or second try.

Jason
  
Baolu Lu Sept. 25, 2023, 2:48 a.m. UTC | #12
On 9/23/23 7:33 AM, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> 
>> virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
>> either; it sets it once it's discovered any instance, since apparently it's
>> assuming that all instances must support identical page sizes, and thus once
>> it's seen one it can work "normally" per the core code's assumptions. It's
>> also I think the only driver which has a "finalise" bodge but*can*  still
>> properly support map-before-attach, by virtue of having to replay mappings
>> to every new endpoint anyway.
> Well it can't quite do that since it doesn't know the geometry - it
> all is sort of guessing and hoping it doesn't explode on replay. If it
> knows the geometry it wouldn't need finalize...

The ultimate solution to this problem seems to be to add device pointer
to the parameter of ops->domain_alloc. So once the domain is allocated,
it is fully initialized. Attaching this domain to a device that is not
compatible will return -EINVAL, then the caller has to allocate a new
domain for this device.

I feel that this is not an AMD specific problem, other iommu drivers
will also encounter the similar problem sooner or later.

Best regards,
baolu
  
Jason Gunthorpe Sept. 25, 2023, 12:40 p.m. UTC | #13
On Mon, Sep 25, 2023 at 10:48:21AM +0800, Baolu Lu wrote:
> On 9/23/23 7:33 AM, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently it's
> > > assuming that all instances must support identical page sizes, and thus once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but*can*  still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> The ultimate solution to this problem seems to be to add device pointer
> to the parameter of ops->domain_alloc. So once the domain is allocated,
> it is fully initialized. Attaching this domain to a device that is not
> compatible will return -EINVAL, then the caller has to allocate a new
> domain for this device.

Sure, it looks like this, and it works already for default domain
setup.

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 8599394c9157d7..1697f5a2370785 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -637,15 +637,10 @@ static void viommu_event_handler(struct virtqueue *vq)
 
 /* IOMMU API */
 
-static struct iommu_domain *viommu_domain_alloc(unsigned type)
+static struct viommu_domain *__viommu_domain_alloc(void)
 {
 	struct viommu_domain *vdomain;
 
-	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA &&
-	    type != IOMMU_DOMAIN_IDENTITY)
-		return NULL;
-
 	vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
 	if (!vdomain)
 		return NULL;
@@ -654,16 +649,32 @@ static struct iommu_domain *viommu_domain_alloc(unsigned type)
 	spin_lock_init(&vdomain->mappings_lock);
 	vdomain->mappings = RB_ROOT_CACHED;
 
+	return vdomain;
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+	struct viommu_domain *vdomain;
+
+	/*
+	 * viommu sometimes creates an identity domain out of a
+	 * paging domain, that can't use the global static.
+	 * During attach it will fill in an identity page table.
+	 */
+	if (type != IOMMU_DOMAIN_IDENTITY)
+		return NULL;
+	vdomain = __viommu_domain_alloc();
+	if (!vdomain)
+		return NULL;
 	return &vdomain->domain;
 }
 
 static int viommu_domain_finalise(struct viommu_endpoint *vdev,
-				  struct iommu_domain *domain)
+				  struct viommu_domain *vdomain)
 {
 	int ret;
 	unsigned long viommu_page_size;
 	struct viommu_dev *viommu = vdev->viommu;
-	struct viommu_domain *vdomain = to_viommu_domain(domain);
 
 	viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap);
 	if (viommu_page_size > PAGE_SIZE) {
@@ -680,13 +691,13 @@ static int viommu_domain_finalise(struct viommu_endpoint *vdev,
 
 	vdomain->id		= (unsigned int)ret;
 
-	domain->pgsize_bitmap	= viommu->pgsize_bitmap;
-	domain->geometry	= viommu->geometry;
+	vdomain->domain.pgsize_bitmap = viommu->pgsize_bitmap;
+	vdomain->domain.geometry = viommu->geometry;
 
 	vdomain->map_flags	= viommu->map_flags;
 	vdomain->viommu		= viommu;
 
-	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+	if (vdomain->domain.type == IOMMU_DOMAIN_IDENTITY) {
 		if (virtio_has_feature(viommu->vdev,
 				       VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
 			vdomain->bypass = true;
@@ -717,6 +728,24 @@ static void viommu_domain_free(struct iommu_domain *domain)
 	kfree(vdomain);
 }
 
+static struct iommu_domain *viommu_domain_alloc_paging(struct device *dev)
+{
+	struct viommu_domain *vdomain;
+	vdomain = __viommu_domain_alloc();
+	if (!vdomain)
+		return NULL;
+
+	if (dev) {
+		struct viommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+		if (viommu_domain_finalise(vdev, vdomain)) {
+			viommu_domain_free(&vdomain->domain);
+			return NULL;
+		}
+	}
+	return &vdomain->domain;
+}
+
 static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int i;
@@ -732,7 +761,7 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		 * Properly initialize the domain now that we know which viommu
 		 * owns it.
 		 */
-		ret = viommu_domain_finalise(vdev, domain);
+		ret = viommu_domain_finalise(vdev, vdomain);
 	} else if (vdomain->viommu != vdev->viommu) {
 		ret = -EINVAL;
 	}
@@ -1045,6 +1074,7 @@ static bool viommu_capable(struct device *dev, enum iommu_cap cap)
 static struct iommu_ops viommu_ops = {
 	.capable		= viommu_capable,
 	.domain_alloc		= viommu_domain_alloc,
+	.domain_alloc_paging	= viommu_domain_alloc_paging,
 	.probe_device		= viommu_probe_device,
 	.probe_finalize		= viommu_probe_finalize,
 	.release_device		= viommu_release_device,
  
Robin Murphy Sept. 25, 2023, 1:07 p.m. UTC | #14
On 2023-09-23 00:33, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> 
>> virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
>> either; it sets it once it's discovered any instance, since apparently it's
>> assuming that all instances must support identical page sizes, and thus once
>> it's seen one it can work "normally" per the core code's assumptions. It's
>> also I think the only driver which has a "finalise" bodge but *can* still
>> properly support map-before-attach, by virtue of having to replay mappings
>> to every new endpoint anyway.
> 
> Well it can't quite do that since it doesn't know the geometry - it
> all is sort of guessing and hoping it doesn't explode on replay. If it
> knows the geometry it wouldn't need finalize...

I think it's entirely reasonable to assume that any direct mappings 
specified for a device are valid for that device and its IOMMU. However, 
in the particular case of virtio, it really shouldn't ever have direct 
mappings anyway, since even if the underlying hardware did have any, the 
host can enforce the actual direct-mapping aspect itself, and just 
present them as unusable regions to the guest.

>>> What do you think about something like this to replace
>>> iommu_create_device_direct_mappings(), that does enforce things
>>> properly?
>>
>> I fail to see how that would make any practical difference. Either the
>> mappings can be correctly set up in a pagetable *before* the relevant device
>> is attached to that pagetable, or they can't (if the driver doesn't have
>> enough information to be able to do so) and we just have to really hope
>> nothing blows up in the race window between attaching the device to an empty
>> pagetable and having a second try at iommu_create_device_direct_mappings().
>> That's a driver-level issue and has nothing to do with pgsize_bitmap either
>> way.
> 
> Except we don't detect this in the core code correctly, that is my
> point. We should detect the aperture conflict, not pgsize_bitmap to
> check if it is the first or second try.

Again, that's irrelevant. It can only be about whether the actual 
->map_pages call succeeds or not. A driver could well know up-front that 
all instances support the same pgsize_bitmap and aperture, and set both 
at ->domain_alloc time, yet still be unable to handle an actual mapping 
without knowing which instance(s) that needs to interact with (e.g. 
omap-iommu).

Thanks,
Robin.
  
Jason Gunthorpe Sept. 25, 2023, 1:29 p.m. UTC | #15
On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote:
> On 2023-09-23 00:33, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently it's
> > > assuming that all instances must support identical page sizes, and thus once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but *can* still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > 
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> I think it's entirely reasonable to assume that any direct mappings
> specified for a device are valid for that device and its IOMMU. However, in
> the particular case of virtio, it really shouldn't ever have direct mappings
> anyway, since even if the underlying hardware did have any, the host can
> enforce the actual direct-mapping aspect itself, and just present them as
> unusable regions to the guest.

I assume this machinery is for the ARM GIC ITS page....

> Again, that's irrelevant. It can only be about whether the actual
> ->map_pages call succeeds or not. A driver could well know up-front that all
> instances support the same pgsize_bitmap and aperture, and set both at
> ->domain_alloc time, yet still be unable to handle an actual mapping without
> knowing which instance(s) that needs to interact with (e.g. omap-iommu).

I think this is a different issue. The domain is supposed to represent
the actual io pte storage, and the storage is supposed to exist even
when the domain is not attached to anything.

As we said with tegra-gart, it is a bug in the driver if all the
mappings disappear when the last device is detached from the domain.
Driver bugs like this turn into significant issues with vfio/iommufd
as this will result in warn_on's and memory leaking.

So, I disagree that this is something we should be allowing in the API
design. map_pages should succeed (memory allocation failures aside) if
a IOVA within the aperture and valid flags are presented. Regardless
of the attachment status. Calling map_pages with an IOVA outside the
aperture should be a caller bug.

It looks omap is just mis-designed to store the pgd in the omap_iommu,
not the omap_iommu_domain :( pgd is clearly a per-domain object in our
API. And why does every instance need its own copy of the identical
pgd?

Jason
  
Robin Murphy Sept. 25, 2023, 5:23 p.m. UTC | #16
On 2023-09-25 14:29, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote:
>> On 2023-09-23 00:33, Jason Gunthorpe wrote:
>>> On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
>>>
>>>> virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
>>>> either; it sets it once it's discovered any instance, since apparently it's
>>>> assuming that all instances must support identical page sizes, and thus once
>>>> it's seen one it can work "normally" per the core code's assumptions. It's
>>>> also I think the only driver which has a "finalise" bodge but *can* still
>>>> properly support map-before-attach, by virtue of having to replay mappings
>>>> to every new endpoint anyway.
>>>
>>> Well it can't quite do that since it doesn't know the geometry - it
>>> all is sort of guessing and hoping it doesn't explode on replay. If it
>>> knows the geometry it wouldn't need finalize...
>>
>> I think it's entirely reasonable to assume that any direct mappings
>> specified for a device are valid for that device and its IOMMU. However, in
>> the particular case of virtio, it really shouldn't ever have direct mappings
>> anyway, since even if the underlying hardware did have any, the host can
>> enforce the actual direct-mapping aspect itself, and just present them as
>> unusable regions to the guest.
> 
> I assume this machinery is for the ARM GIC ITS page....
> 
>> Again, that's irrelevant. It can only be about whether the actual
>> ->map_pages call succeeds or not. A driver could well know up-front that all
>> instances support the same pgsize_bitmap and aperture, and set both at
>> ->domain_alloc time, yet still be unable to handle an actual mapping without
>> knowing which instance(s) that needs to interact with (e.g. omap-iommu).
> 
> I think this is a different issue. The domain is supposed to represent
> the actual io pte storage, and the storage is supposed to exist even
> when the domain is not attached to anything.
> 
> As we said with tegra-gart, it is a bug in the driver if all the
> mappings disappear when the last device is detached from the domain.
> Driver bugs like this turn into significant issues with vfio/iommufd
> as this will result in warn_on's and memory leaking.
> 
> So, I disagree that this is something we should be allowing in the API
> design. map_pages should succeed (memory allocation failures aside) if
> a IOVA within the aperture and valid flags are presented. Regardless
> of the attachment status. Calling map_pages with an IOVA outside the
> aperture should be a caller bug.
> 
> It looks omap is just mis-designed to store the pgd in the omap_iommu,
> not the omap_iommu_domain :( pgd is clearly a per-domain object in our
> API. And why does every instance need its own copy of the identical
> pgd?

The point wasn't that it was necessarily a good and justifiable example, 
just that it is one that exists, to demonstrate that in general we have 
no reasonable heuristic for guessing whether ->map_pages is going to 
succeed or not other than by calling it and seeing if it succeeds or 
not. And IMO it's a complete waste of time thinking about ways to make 
such a heuristic possible instead of just getting on with fixing 
iommu_domain_alloc() to make the problem disappear altogether. Once 
Joerg pushes out the current queue I'll rebase and resend v4 of the bus 
ops removal, then hopefully get back to despairing at the hideous pile 
of WIP iommu_domain_alloc() patches I currently have on top of it...

Thanks,
Robin.
  

Patch

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 17dcd826f5c2..3649586f0e5c 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -189,6 +189,12 @@  static int viommu_sync_req(struct viommu_dev *viommu)
 	int ret;
 	unsigned long flags;
 
+	/*
+	 * .iotlb_sync_map and .flush_iotlb_all may be called before the viommu
+	 * is initialized e.g. via iommu_create_device_direct_mappings()
+	 */
+	if (!viommu)
+		return 0;
 	spin_lock_irqsave(&viommu->request_lock, flags);
 	ret = __viommu_sync_req(viommu);
 	if (ret)
@@ -843,7 +849,7 @@  static int viommu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			.flags		= cpu_to_le32(flags),
 		};
 
-		ret = viommu_send_req_sync(vdomain->viommu, &map, sizeof(map));
+		ret = viommu_add_req(vdomain->viommu, &map, sizeof(map));
 		if (ret) {
 			viommu_del_mappings(vdomain, iova, end);
 			return ret;
@@ -912,6 +918,14 @@  static void viommu_iotlb_sync(struct iommu_domain *domain,
 	viommu_sync_req(vdomain->viommu);
 }
 
+static int viommu_iotlb_sync_map(struct iommu_domain *domain,
+				 unsigned long iova, size_t size)
+{
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	return viommu_sync_req(vdomain->viommu);
+}
+
 static void viommu_get_resv_regions(struct device *dev, struct list_head *head)
 {
 	struct iommu_resv_region *entry, *new_entry, *msi = NULL;
@@ -1058,6 +1072,7 @@  static struct iommu_ops viommu_ops = {
 		.unmap_pages		= viommu_unmap_pages,
 		.iova_to_phys		= viommu_iova_to_phys,
 		.iotlb_sync		= viommu_iotlb_sync,
+		.iotlb_sync_map		= viommu_iotlb_sync_map,
 		.free			= viommu_domain_free,
 	}
 };