iommu: Avoid races around device probe

Message ID 1946ef9f774851732eed78760a78ec40dbc6d178.1667591503.git.robin.murphy@arm.com
State New
Headers
Series iommu: Avoid races around device probe |

Commit Message

Robin Murphy Nov. 4, 2022, 7:51 p.m. UTC
  We currently have 3 different ways that __iommu_probe_device() may be
called, but no real guarantee that multiple callers can't tread on each
other, especially once asynchronous driver probe gets involved. It would
likely have taken a fair bit of luck to hit this previously, but commit
57365a04c921 ("iommu: Move bus setup to IOMMU device registration") ups
the odds since now it's not just omap-iommu that may trigger multiple
bus_iommu_probe() calls in parallel if probing asynchronously.

Add a lock to ensure we can't try to double-probe a device, and also
close some possible race windows to make sure we're truly robust against
trying to double-initialise a group via two different member devices.

Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iommu.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)
  

Comments

Brian Norris Nov. 5, 2022, 1:36 a.m. UTC | #1
On Fri, Nov 04, 2022 at 07:51:43PM +0000, Robin Murphy wrote:
> We currently have 3 different ways that __iommu_probe_device() may be
> called, but no real guarantee that multiple callers can't tread on each
> other, especially once asynchronous driver probe gets involved. It would
> likely have taken a fair bit of luck to hit this previously, but commit
> 57365a04c921 ("iommu: Move bus setup to IOMMU device registration") ups
> the odds since now it's not just omap-iommu that may trigger multiple
> bus_iommu_probe() calls in parallel if probing asynchronously.
> 
> Add a lock to ensure we can't try to double-probe a device, and also
> close some possible race windows to make sure we're truly robust against
> trying to double-initialise a group via two different member devices.
> 
> Reported-by: Brian Norris <briannorris@chromium.org>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/iommu/iommu.c | 28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)

If I've tested appropriately (there's always room for operator error),
this seems to resolve the problems I reported:

Tested-by: Brian Norris <briannorris@chromium.org>

I haven't reviewed closely enough to know how precisely this is a
regression (your description sounds like you think the bug existed some
time before that), but based on testing, this sounds like:

Fixes: 57365a04c921 ("iommu: Move bus setup to IOMMU device
registration")

But even if not, the report could probably use:

Link: https://lore.kernel.org/lkml/Y1CHh2oM5wyHs06J@google.com/

And most of all, thanks!

Brian
  
Robin Murphy Nov. 7, 2022, 4:41 p.m. UTC | #2
On 2022-11-05 01:36, Brian Norris wrote:
> On Fri, Nov 04, 2022 at 07:51:43PM +0000, Robin Murphy wrote:
>> We currently have 3 different ways that __iommu_probe_device() may be
>> called, but no real guarantee that multiple callers can't tread on each
>> other, especially once asynchronous driver probe gets involved. It would
>> likely have taken a fair bit of luck to hit this previously, but commit
>> 57365a04c921 ("iommu: Move bus setup to IOMMU device registration") ups
>> the odds since now it's not just omap-iommu that may trigger multiple
>> bus_iommu_probe() calls in parallel if probing asynchronously.
>>
>> Add a lock to ensure we can't try to double-probe a device, and also
>> close some possible race windows to make sure we're truly robust against
>> trying to double-initialise a group via two different member devices.
>>
>> Reported-by: Brian Norris <briannorris@chromium.org>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>   drivers/iommu/iommu.c | 28 ++++++++++++++++++++++------
>>   1 file changed, 22 insertions(+), 6 deletions(-)
> 
> If I've tested appropriately (there's always room for operator error),
> this seems to resolve the problems I reported:
> 
> Tested-by: Brian Norris <briannorris@chromium.org>
> 
> I haven't reviewed closely enough to know how precisely this is a
> regression (your description sounds like you think the bug existed some
> time before that), but based on testing, this sounds like:
> 
> Fixes: 57365a04c921 ("iommu: Move bus setup to IOMMU device
> registration")

That commit did not introduce the race, just made it more visible. The 
underlying condition probably goes back at least 3 years to where we 
started allocating and freeing per-device data around what was then the 
ops->add_device() call.

In practice, you'd have to be absurdly lucky for an iommu_probe_device() 
call via {of,acpi}_dma_configure() to line up with bus_iommu_probe() 
touching the same device, but by inspection I think it's theoretically 
possible. Thus previously there was probably only a realistic chance of 
seeing it on certain OMAP systems, where the explicit bus_iommu_probe() 
calls could overlap if both instances probed in parallel - my commit 
just brings all the other drivers in line with that same behaviour via 
iommu_device_register(). Other systems - like Rockchip in particular - 
may have greater numbers of IOMMU instances and thus even more chance 
for parallel probes to line up just right.

Since nobody's ever reported real-world issues on OMAP (although it's 
quite likely nobody's ever tried driver_async_probe with omap-iommu 
anyway) there doesn't seem to be a compelling reason for backporting, so 
I didn't fancy spending hours digging through subsystem-wide history 
trying to figure out an appropriate fixes tag; as long as this can make 
6.1 that should be enough :)

Thanks,
Robin.

> But even if not, the report could probably use:
> 
> Link: https://lore.kernel.org/lkml/Y1CHh2oM5wyHs06J@google.com/
> 
> And most of all, thanks!
> 
> Brian
  
Joerg Roedel Nov. 19, 2022, 9:18 a.m. UTC | #3
On Fri, Nov 04, 2022 at 07:51:43PM +0000, Robin Murphy wrote:
> We currently have 3 different ways that __iommu_probe_device() may be
> called, but no real guarantee that multiple callers can't tread on each
> other, especially once asynchronous driver probe gets involved. It would
> likely have taken a fair bit of luck to hit this previously, but commit
> 57365a04c921 ("iommu: Move bus setup to IOMMU device registration") ups
> the odds since now it's not just omap-iommu that may trigger multiple
> bus_iommu_probe() calls in parallel if probing asynchronously.
> 
> Add a lock to ensure we can't try to double-probe a device, and also
> close some possible race windows to make sure we're truly robust against
> trying to double-initialise a group via two different member devices.
> 
> Reported-by: Brian Norris <briannorris@chromium.org>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/iommu/iommu.c | 28 ++++++++++++++++++++++------
>  1 file changed, 22 insertions(+), 6 deletions(-)

Applied, thanks Robin.
  

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 65a3b3d886dc..959d895fc1df 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -283,13 +283,23 @@  static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
 	struct iommu_device *iommu_dev;
 	struct iommu_group *group;
+	static DEFINE_MUTEX(iommu_probe_device_lock);
 	int ret;
 
 	if (!ops)
 		return -ENODEV;
-
-	if (!dev_iommu_get(dev))
-		return -ENOMEM;
+	/*
+	 * Serialise to avoid races between IOMMU drivers registering in
+	 * parallel and/or the "replay" calls from ACPI/OF code via client
+	 * driver probe. Once the latter have been cleaned up we should
+	 * probably be able to use device_lock() here to minimise the scope,
+	 * but for now enforcing a simple global ordering is fine.
+	 */
+	mutex_lock(&iommu_probe_device_lock);
+	if (!dev_iommu_get(dev)) {
+		ret = -ENOMEM;
+		goto err_unlock;
+	}
 
 	if (!try_module_get(ops->owner)) {
 		ret = -EINVAL;
@@ -309,11 +319,14 @@  static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 		ret = PTR_ERR(group);
 		goto out_release;
 	}
-	iommu_group_put(group);
 
+	mutex_lock(&group->mutex);
 	if (group_list && !group->default_domain && list_empty(&group->entry))
 		list_add_tail(&group->entry, group_list);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
 
+	mutex_unlock(&iommu_probe_device_lock);
 	iommu_device_link(iommu_dev, dev);
 
 	return 0;
@@ -328,6 +341,9 @@  static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 err_free:
 	dev_iommu_free(dev);
 
+err_unlock:
+	mutex_unlock(&iommu_probe_device_lock);
+
 	return ret;
 }
 
@@ -1799,11 +1815,11 @@  int bus_iommu_probe(struct bus_type *bus)
 		return ret;
 
 	list_for_each_entry_safe(group, next, &group_list, entry) {
+		mutex_lock(&group->mutex);
+
 		/* Remove item from the list */
 		list_del_init(&group->entry);
 
-		mutex_lock(&group->mutex);
-
 		/* Try to allocate default domain */
 		probe_alloc_default_domain(bus, group);