[v2,1/3] iommufd: Add devices_users to track the hw_pagetable usage by device

Message ID c1c65ce093a3b585546dd17a77949efbe03a81d9.1674939002.git.nicolinc@nvidia.com
State New
Headers
Series iommufd: Remove iommufd_hw_pagetable_has_group |

Commit Message

Nicolin Chen Jan. 28, 2023, 9:18 p.m. UTC
  From: Yi Liu <yi.l.liu@intel.com>

Currently, hw_pagetable tracks the attached devices using a device list.
When attaching the first device to the kernel-managed hw_pagetable, it
should be linked to IOAS. When detaching the last device from this hwpt,
the link with IOAS should be removed too. And this first-or-last device
check is done with list_empty(hwpt->devices).

However, with a nested configuration, when a device is attached to the
user-managed stage-1 hw_pagetable, it will be added to this user-managed
hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
this breaks the logic for a kernel-managed hw_pagetable link/disconnect
to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
multiple times if multiple device is attached, but it will become empty
as soon as one device detached.

Add a devices_users in struct iommufd_hw_pagetable to track the users of
hw_pagetable by the attached devices. Make this field as a pointer, only
allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse
the stage-2 hw_pagetable's devices_users, because when a device attaches
to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still
required. So, with a nested configuration, increase the devices_users on
the stage-2 (parent) hwpt, no matter a device is attached to the stage-1
or the stage-2 hwpt.

Adding this devices_users also reduces the dependency on the device list,
so it helps the following patch to remove the device list completely.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/device.c          |  8 +++++---
 drivers/iommu/iommufd/hw_pagetable.c    | 11 +++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 3 files changed, 17 insertions(+), 3 deletions(-)
  

Comments

Tian, Kevin Jan. 29, 2023, 9:23 a.m. UTC | #1
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, January 29, 2023 5:18 AM
> 
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Currently, hw_pagetable tracks the attached devices using a device list.
> When attaching the first device to the kernel-managed hw_pagetable, it
> should be linked to IOAS. When detaching the last device from this hwpt,
> the link with IOAS should be removed too. And this first-or-last device
> check is done with list_empty(hwpt->devices).
> 
> However, with a nested configuration, when a device is attached to the
> user-managed stage-1 hw_pagetable, it will be added to this user-managed
> hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> multiple times if multiple device is attached, but it will become empty
> as soon as one device detached.
> 
> Add a devices_users in struct iommufd_hw_pagetable to track the users of

device_users

> hw_pagetable by the attached devices. Make this field as a pointer, only
> allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse
> the stage-2 hw_pagetable's devices_users, because when a device attaches
> to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still
> required. So, with a nested configuration, increase the devices_users on
> the stage-2 (parent) hwpt, no matter a device is attached to the stage-1
> or the stage-2 hwpt.

Above is very confusing w/o seeing the full series of nesting support.

As a preparatory step this should focus on existing code and what this
series tries to achieve. e.g. I'd not make device_users a pointer here.
Do that incrementally when the nesting support comes.
  
Nicolin Chen Jan. 29, 2023, 9:30 a.m. UTC | #2
On Sun, Jan 29, 2023 at 09:23:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Sunday, January 29, 2023 5:18 AM
> >
> > From: Yi Liu <yi.l.liu@intel.com>
> >
> > Currently, hw_pagetable tracks the attached devices using a device list.
> > When attaching the first device to the kernel-managed hw_pagetable, it
> > should be linked to IOAS. When detaching the last device from this hwpt,
> > the link with IOAS should be removed too. And this first-or-last device
> > check is done with list_empty(hwpt->devices).
> >
> > However, with a nested configuration, when a device is attached to the
> > user-managed stage-1 hw_pagetable, it will be added to this user-managed
> > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> > this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> > multiple times if multiple device is attached, but it will become empty
> > as soon as one device detached.
> >
> > Add a devices_users in struct iommufd_hw_pagetable to track the users of
> 
> device_users

I assume you are suggesting a rename right? I can do that.

> > hw_pagetable by the attached devices. Make this field as a pointer, only
> > allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse
> > the stage-2 hw_pagetable's devices_users, because when a device attaches
> > to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still
> > required. So, with a nested configuration, increase the devices_users on
> > the stage-2 (parent) hwpt, no matter a device is attached to the stage-1
> > or the stage-2 hwpt.
> 
> Above is very confusing w/o seeing the full series of nesting support.
> 
> As a preparatory step this should focus on existing code and what this
> series tries to achieve. e.g. I'd not make device_users a pointer here.
> Do that incrementally when the nesting support comes.

OK. I will shift that part to the nesting series.

Thanks
Nic
  
Tian, Kevin Jan. 29, 2023, 9:39 a.m. UTC | #3
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, January 29, 2023 5:31 PM
> 
> > > Add a devices_users in struct iommufd_hw_pagetable to track the users
> of
> >
> > device_users
> 
> I assume you are suggesting a rename right? I can do that.
> 

yes
  
Yi Liu Jan. 30, 2023, 2:22 a.m. UTC | #4
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Sunday, January 29, 2023 5:23 PM
>
> > hw_pagetable by the attached devices. Make this field as a pointer, only
> > allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should
> reuse
> > the stage-2 hw_pagetable's devices_users, because when a device
> attaches
> > to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still
> > required. So, with a nested configuration, increase the devices_users on
> > the stage-2 (parent) hwpt, no matter a device is attached to the stage-1
> > or the stage-2 hwpt.
> 
> Above is very confusing w/o seeing the full series of nesting support.
> 
> As a preparatory step this should focus on existing code and what this
> series tries to achieve. e.g. I'd not make device_users a pointer here.
> Do that incrementally when the nesting support comes.

Yes, in the below branch, I've moved this patch to be together with the nesting
commits. Maybe I can send out the nesting RFC.

https://github.com/yiliu1765/iommufd/commits/wip/iommufd-v6.2-rc4-nesting

Regards,
Yi Liu
  
Jason Gunthorpe Jan. 30, 2023, 3:02 p.m. UTC | #5
On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Currently, hw_pagetable tracks the attached devices using a device list.
> When attaching the first device to the kernel-managed hw_pagetable, it
> should be linked to IOAS. When detaching the last device from this hwpt,
> the link with IOAS should be removed too. And this first-or-last device
> check is done with list_empty(hwpt->devices).
> 
> However, with a nested configuration, when a device is attached to the
> user-managed stage-1 hw_pagetable, it will be added to this user-managed
> hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> multiple times if multiple device is attached, but it will become empty
> as soon as one device detached.

Why this seems really weird to say.

The stage 2 is linked explicitly to the IOAS that drives it's
map/unmap

Why is there any implicit activity here? There should be no implicit
attach of the S2 to an IOAS ever.

Jason
  
Nicolin Chen Jan. 30, 2023, 7:27 p.m. UTC | #6
On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote:
> On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote:
> > From: Yi Liu <yi.l.liu@intel.com>
> > 
> > Currently, hw_pagetable tracks the attached devices using a device list.
> > When attaching the first device to the kernel-managed hw_pagetable, it
> > should be linked to IOAS. When detaching the last device from this hwpt,
> > the link with IOAS should be removed too. And this first-or-last device
> > check is done with list_empty(hwpt->devices).
> > 
> > However, with a nested configuration, when a device is attached to the
> > user-managed stage-1 hw_pagetable, it will be added to this user-managed
> > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> > this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> > multiple times if multiple device is attached, but it will become empty
> > as soon as one device detached.
> 
> Why this seems really weird to say.
> 
> The stage 2 is linked explicitly to the IOAS that drives it's
> map/unmap
> 
> Why is there any implicit activity here? There should be no implicit
> attach of the S2 to an IOAS ever.

I think this is supposed to say the following use case:

Two stage-1 hwpts share the same parent s2_hwpt:

attach device1 to stage-1 hwpt1:
	...
	if (list_empty(s1_hwpt1->devices))		// empty; true
		iopt_table_add_domain(s2_hwpt->domain); // do once
	s1_hwpt1 device list cnt++;
	...

attach device2 to stage-1 hwpt2:
	...
	if (list_empty(s1_hwpt2->devices))		// empty; true
		iopt_table_add_domain(s2_hwpt->domain); // do again
	s1_hwpt2 device list cnt++;
	...

This is because each hwpt has its own device list. To prevent
the duplicated iopt_table_add_domain call, we need to check all
the device list. So this patch adds a shared list among all
relevant hwpts.

I can revise the commit message to make a better sense.

Thanks
Nicolin
  
Jason Gunthorpe Jan. 30, 2023, 7:50 p.m. UTC | #7
On Mon, Jan 30, 2023 at 11:27:37AM -0800, Nicolin Chen wrote:
> On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote:
> > On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote:
> > > From: Yi Liu <yi.l.liu@intel.com>
> > > 
> > > Currently, hw_pagetable tracks the attached devices using a device list.
> > > When attaching the first device to the kernel-managed hw_pagetable, it
> > > should be linked to IOAS. When detaching the last device from this hwpt,
> > > the link with IOAS should be removed too. And this first-or-last device
> > > check is done with list_empty(hwpt->devices).
> > > 
> > > However, with a nested configuration, when a device is attached to the
> > > user-managed stage-1 hw_pagetable, it will be added to this user-managed
> > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> > > multiple times if multiple device is attached, but it will become empty
> > > as soon as one device detached.
> > 
> > Why this seems really weird to say.
> > 
> > The stage 2 is linked explicitly to the IOAS that drives it's
> > map/unmap
> > 
> > Why is there any implicit activity here? There should be no implicit
> > attach of the S2 to an IOAS ever.
> 
> I think this is supposed to say the following use case:
> 
> Two stage-1 hwpts share the same parent s2_hwpt:
> 
> attach device1 to stage-1 hwpt1:
> 	...
> 	if (list_empty(s1_hwpt1->devices))		// empty; true
> 		iopt_table_add_domain(s2_hwpt->domain); // do once
> 	s1_hwpt1 device list cnt++;
> 	...

No, this doesn't make sense.

The s2_hwpt should be created explicitly, not using autodomains

When it is created it should be linked to a single IOAS and that is
when iopt_table_add_domain() should have been called.

The S1 attach should do *nothing* to a S2.

Jason
  
Nicolin Chen Jan. 30, 2023, 8:04 p.m. UTC | #8
On Mon, Jan 30, 2023 at 03:50:07PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 30, 2023 at 11:27:37AM -0800, Nicolin Chen wrote:
> > On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote:
> > > On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote:
> > > > From: Yi Liu <yi.l.liu@intel.com>
> > > > 
> > > > Currently, hw_pagetable tracks the attached devices using a device list.
> > > > When attaching the first device to the kernel-managed hw_pagetable, it
> > > > should be linked to IOAS. When detaching the last device from this hwpt,
> > > > the link with IOAS should be removed too. And this first-or-last device
> > > > check is done with list_empty(hwpt->devices).
> > > > 
> > > > However, with a nested configuration, when a device is attached to the
> > > > user-managed stage-1 hw_pagetable, it will be added to this user-managed
> > > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And
> > > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect
> > > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS
> > > > multiple times if multiple device is attached, but it will become empty
> > > > as soon as one device detached.
> > > 
> > > Why this seems really weird to say.
> > > 
> > > The stage 2 is linked explicitly to the IOAS that drives it's
> > > map/unmap
> > > 
> > > Why is there any implicit activity here? There should be no implicit
> > > attach of the S2 to an IOAS ever.
> > 
> > I think this is supposed to say the following use case:
> > 
> > Two stage-1 hwpts share the same parent s2_hwpt:
> > 
> > attach device1 to stage-1 hwpt1:
> > 	...
> > 	if (list_empty(s1_hwpt1->devices))		// empty; true
> > 		iopt_table_add_domain(s2_hwpt->domain); // do once
> > 	s1_hwpt1 device list cnt++;
> > 	...
> 
> No, this doesn't make sense.
> 
> The s2_hwpt should be created explicitly, not using autodomains

iopt_table_add_domain() is called in iommufd_device_do_attach(),
so it's shared by both a created hwpt or autodomain.

> When it is created it should be linked to a single IOAS and that is
> when iopt_table_add_domain() should have been called.

I recall we've discussed this that SMMU sets up domain when it
attaches the device to, so we made a compromise here...

> The S1 attach should do *nothing* to a S2.

With that compromise, a nested attach flow may be
1) create an s2 hwpt
2) create an s1 hwpt
3) attach dev to s1
    calls iopt_table_add_domain()

Thanks
Nicolin
  
Jason Gunthorpe Jan. 30, 2023, 8:35 p.m. UTC | #9
On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote:

> I recall we've discussed this that SMMU sets up domain when it
> attaches the device to, so we made a compromise here...

The ARM driver has a problem that it doesn't know what SMMU instance
will host the domain when it is allocated so it doesn't know if it
should select a S1 or S2 page table format - which is determined by
the capabilities of the specific SMMU HW block.

However, we don't have this problem when creating the S2. The S2
should be created by a special alloc_domain_iommufd() asking for the
S2 format. Not only does the new alloc_domain_iommufd API directly
request a S2 format table, but it also specifies the struct device so
any residual details can be determined directly.

Thus there is no need to do the two stage process when working with
the S2.

So fixup the driver to create fully configured iommu_domain's
immediately and get rid of this problem.

IMHO I would structure the smmu driver so that all the different
iommu_domain formats have their own ops pointer. The special
"undecided" format would have a special ops with only attach_dev and
at first attach it would switch the ops to whatever format it
selected.

I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
complexity all over the place. You know what type it is because you
were called on a op that is only called on its type.

Jason
  
Nicolin Chen Jan. 30, 2023, 8:53 p.m. UTC | #10
On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote:
> 
> > I recall we've discussed this that SMMU sets up domain when it
> > attaches the device to, so we made a compromise here...
> 
> The ARM driver has a problem that it doesn't know what SMMU instance
> will host the domain when it is allocated so it doesn't know if it
> should select a S1 or S2 page table format - which is determined by
> the capabilities of the specific SMMU HW block.
> 
> However, we don't have this problem when creating the S2. The S2
> should be created by a special alloc_domain_iommufd() asking for the
> S2 format. Not only does the new alloc_domain_iommufd API directly
> request a S2 format table, but it also specifies the struct device so
> any residual details can be determined directly.
> 
> Thus there is no need to do the two stage process when working with
> the S2.

Ah, right! Taking a quick look, we should be able to call that
arm_smmu_domain_finalise when handling alloc_domain_iommufd().

> So fixup the driver to create fully configured iommu_domain's
> immediately and get rid of this problem.

OK. I will draft a patch today.

> IMHO I would structure the smmu driver so that all the different
> iommu_domain formats have their own ops pointer. The special
> "undecided" format would have a special ops with only attach_dev and
> at first attach it would switch the ops to whatever format it
> selected.
> 
> I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
> complexity all over the place. You know what type it is because you
> were called on a op that is only called on its type.

I see. I can try that as well. Hopefully it won't touch too
many places that cam raise a potential big concern/objection..

Thanks
Nicolin
  
Nicolin Chen Feb. 1, 2023, 6:57 a.m. UTC | #11
On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
 
> IMHO I would structure the smmu driver so that all the different
> iommu_domain formats have their own ops pointer. The special
> "undecided" format would have a special ops with only attach_dev and
> at first attach it would switch the ops to whatever format it
> selected.
> 
> I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
> complexity all over the place. You know what type it is because you
> were called on a op that is only called on its type.

An auto/unmanaged domain allocation via iommu_domain_alloc() would
be S1, while an allocation via ops->domain_alloc_user can be S1 or
S2 with a given parameter/flag. So, actually the format is always
decided. The trouble we have with the iommu_domain_alloc() path is
that we don't pass the dev pointer down to ops->domain_alloc. So,
the SMMU driver can't know which SMMU device the device is behind,
resulting in being unable to finalizing the domain. Robin mentioned
that he has a patch "iommu: Pass device through ops->domain_alloc".
Perhaps that is required for us to entirely fix the add_domain()
problem?

Thanks
Nic
  
Nicolin Chen Feb. 1, 2023, 7:48 a.m. UTC | #12
On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote:
> On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote:
> > 
> > > I recall we've discussed this that SMMU sets up domain when it
> > > attaches the device to, so we made a compromise here...
> > 
> > The ARM driver has a problem that it doesn't know what SMMU instance
> > will host the domain when it is allocated so it doesn't know if it
> > should select a S1 or S2 page table format - which is determined by
> > the capabilities of the specific SMMU HW block.
> > 
> > However, we don't have this problem when creating the S2. The S2
> > should be created by a special alloc_domain_iommufd() asking for the
> > S2 format. Not only does the new alloc_domain_iommufd API directly
> > request a S2 format table, but it also specifies the struct device so
> > any residual details can be determined directly.
> > 
> > Thus there is no need to do the two stage process when working with
> > the S2.
> 
> Ah, right! Taking a quick look, we should be able to call that
> arm_smmu_domain_finalise when handling alloc_domain_iommufd().
> 
> > So fixup the driver to create fully configured iommu_domain's
> > immediately and get rid of this problem.
> 
> OK. I will draft a patch today.

@Yi
Do you recall doing iopt_table_add_domain() in hwpt_alloc()?

Jason has a great point above. So even SMMU should be able to
call the iopt_table_add_domain() after a kernel-manged hwpt
allocation rather than after an iommu_attach_group(), except
an auto_domain or a selftest mock_domain that still needs to
attach the device first, otherwise the SMMU driver (currently)
cannot finalize the domain aperture.

I made a draft today, and ran some sanity with SMMUv3:
  "iommufd: Attach/detach hwpt to IOAS at alloc/destroy"
  https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1eb00149912145639

The change basically: moves iopt_table_add/remove_domain()
next to hwpt_alloc/destroy(); an auto_domain or a mock_domain
needs to attach_dev first, before calling the add_domain().

With this change, following patches of attach_ioas() and new
selftest should be also updated. I have them in a wip branch:
https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc4-nesting-01312023

Can you check if there's anything wrong with this approach?
And would it be possible for you to integrate this into the
nesting series?

I can also help cleanup and polish the changes with a proper
commit message.

Thanks
Nic
  
Nicolin Chen Feb. 1, 2023, 7:56 a.m. UTC | #13
On Tue, Jan 31, 2023 at 10:57:19PM -0800, Nicolin Chen wrote:
> On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
>  
> > IMHO I would structure the smmu driver so that all the different
> > iommu_domain formats have their own ops pointer. The special
> > "undecided" format would have a special ops with only attach_dev and
> > at first attach it would switch the ops to whatever format it
> > selected.
> > 
> > I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
> > complexity all over the place. You know what type it is because you
> > were called on a op that is only called on its type.
> 
> An auto/unmanaged domain allocation via iommu_domain_alloc() would
> be S1, while an allocation via ops->domain_alloc_user can be S1 or
> S2 with a given parameter/flag. So, actually the format is always
> decided. The trouble we have with the iommu_domain_alloc() path is
> that we don't pass the dev pointer down to ops->domain_alloc. So,

Precisely, the undecided part is not the format but the ias/ios
and SMMU feature bits of the SMMU HW device's, in order to set
up the domain->geometry. And the regular domain_alloc() taking
a "type" augment alone can't get the SMMU pointer.

Thanks
Nic

> the SMMU driver can't know which SMMU device the device is behind,
> resulting in being unable to finalizing the domain. Robin mentioned
> that he has a patch "iommu: Pass device through ops->domain_alloc".
> Perhaps that is required for us to entirely fix the add_domain()
> problem?
> 
> Thanks
> Nic
  
Jason Gunthorpe Feb. 1, 2023, 3:53 p.m. UTC | #14
On Tue, Jan 31, 2023 at 10:57:13PM -0800, Nicolin Chen wrote:
> On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
>  
> > IMHO I would structure the smmu driver so that all the different
> > iommu_domain formats have their own ops pointer. The special
> > "undecided" format would have a special ops with only attach_dev and
> > at first attach it would switch the ops to whatever format it
> > selected.
> > 
> > I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
> > complexity all over the place. You know what type it is because you
> > were called on a op that is only called on its type.
> 
> An auto/unmanaged domain allocation via iommu_domain_alloc() would
> be S1, while an allocation via ops->domain_alloc_user can be S1 or
> S2 with a given parameter/flag. So, actually the format is always
> decided. 

No, it can't decide the S1/S2 format until it knows the smmu because
of this:

	/* Restrict the stage to what we can actually support */
	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
		smmu_domain->stage = ARM_SMMU_DOMAIN_S1;

So the format is never decided.

> that we don't pass the dev pointer down to ops->domain_alloc. So,
> the SMMU driver can't know which SMMU device the device is behind,
> resulting in being unable to finalizing the domain. Robin mentioned
> that he has a patch "iommu: Pass device through ops->domain_alloc".
> Perhaps that is required for us to entirely fix the add_domain()
> problem?

Robin is making progress, hopefully soon

So the issue is with replace you need to have the domain populated
before we can call replace but you can't populate the domain until it
is bound because of the above issue? That seems unsovlable without
fixing up the driver.

I'd say replace can go ahead ingoring that issue and that for now
replace will only work on ARM with domains created by
domain_alloc_user that are fully configured.

It will start working correctly for auto domains once Robin's changes
get finished.

Is there another issue?

Jason
  
Nicolin Chen Feb. 1, 2023, 5:46 p.m. UTC | #15
On Wed, Feb 01, 2023 at 11:53:02AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 31, 2023 at 10:57:13PM -0800, Nicolin Chen wrote:
> > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
> >  
> > > IMHO I would structure the smmu driver so that all the different
> > > iommu_domain formats have their own ops pointer. The special
> > > "undecided" format would have a special ops with only attach_dev and
> > > at first attach it would switch the ops to whatever format it
> > > selected.
> > > 
> > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD'
> > > complexity all over the place. You know what type it is because you
> > > were called on a op that is only called on its type.
> > 
> > An auto/unmanaged domain allocation via iommu_domain_alloc() would
> > be S1, while an allocation via ops->domain_alloc_user can be S1 or
> > S2 with a given parameter/flag. So, actually the format is always
> > decided. 
> 
> No, it can't decide the S1/S2 format until it knows the smmu because
> of this:
> 
> 	/* Restrict the stage to what we can actually support */
> 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
> 		smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
> 
> So the format is never decided.

OK. That's right. And the solution to that is also passing a dev
pointer in regular ->domain_alloc() op.

> > that we don't pass the dev pointer down to ops->domain_alloc. So,
> > the SMMU driver can't know which SMMU device the device is behind,
> > resulting in being unable to finalizing the domain. Robin mentioned
> > that he has a patch "iommu: Pass device through ops->domain_alloc".
> > Perhaps that is required for us to entirely fix the add_domain()
> > problem?
> 
> Robin is making progress, hopefully soon
> 
> So the issue is with replace you need to have the domain populated
> before we can call replace but you can't populate the domain until it
> is bound because of the above issue? That seems unsovlable without
> fixing up the driver.

Not really. A REPLACE ioctl is just an ATTACH, if the device just
gets BIND-ed. So the SMMU driver will initialize ("finalise") the
domain during the replace() call, then iopt_table_add_domain() can
be done.

So, not a blocker here.

> I'd say replace can go ahead ingoring that issue and that for now
> replace will only work on ARM with domains created by
> domain_alloc_user that are fully configured.
> 
> It will start working correctly for auto domains once Robin's changes
> get finished.
> 
> Is there another issue?

Oh. I think we mixed the topics here. These three patches were
not to unblock but to clean up a way for the replace series and
the nesting series, for the device locking issue:

	if (cur_hwpt != hwpt)
		mutex_lock(&cur_hwpt->device_lock);
	mutex_lock(&hwpt->device_lock);
	...
	if (iommufd_hw_pagetabe_has_group()) {	// touching device list
		...
		iommu_group_replace_domain();
		...
	}
	if (cur_hwpt && hwpt)
		list_del(&idev->devices_item);
	list_add(&idev->devices_item, &cur_hwpt->devices);
	...
	mutex_unlock(&hwpt->device_lock);
	if (cur_hwpt != hwpt)
		mutex_unlock(&cur_hwpt->device_lock);

I just gave another thought about it. Since we have the patch-2
from this series moving the ioas->mutex, it already serializes
attach/detach routines. And I see that all the places touching
idev->device_item and hwpt->devices are protected by ioas->mutex.
So, perhaps we can simply remove the device_lock?

do_attach():
	mutex_lock(&ioas->mutex); // protect both devices_item and hwpt_item
	...
	if (iommufd_hw_pagetabe_has_group()) {	// touching device list
		...
		iommu_group_replace_domain();
		...
	}
	if (cur_hwpt && hwpt)
		list_del(&idev->devices_item);
	list_add(&idev->devices_item, &cur_hwpt->devices);
	...
	mutex_unlock(&ioas->mutex);

do_detach():
	mutex_lock(&ioas->mutex); // protect both devices_item and hwpt_item
	...
	if (iommufd_hw_pagetabe_has_group()) {	// touching device list
		...
		iommu_detach_group();
		...
	}
	list_del(&idev->devices_item);
	...
	mutex_unlock(&ioas->mutex);

If this is correct, I think I can prepare the replace series and
send it by the end of the day.

Thanks
Nic
  
Jason Gunthorpe Feb. 1, 2023, 6:37 p.m. UTC | #16
On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote:
> > So the issue is with replace you need to have the domain populated
> > before we can call replace but you can't populate the domain until it
> > is bound because of the above issue? That seems unsovlable without
> > fixing up the driver.
> 
> Not really. A REPLACE ioctl is just an ATTACH, if the device just
> gets BIND-ed. So the SMMU driver will initialize ("finalise") the
> domain during the replace() call, then iopt_table_add_domain() can
> be done.
> 
> So, not a blocker here.

Well, yes, there sort of is because the whole flow becomes nonsensical
- we are supposed to have the iommu_domain populated by the time we do
replace. Otherwise replace is extra-pointless..

> > Is there another issue?
> 
> Oh. I think we mixed the topics here. These three patches were
> not to unblock but to clean up a way for the replace series and
> the nesting series, for the device locking issue:
> 
> 	if (cur_hwpt != hwpt)
> 		mutex_lock(&cur_hwpt->device_lock);
> 	mutex_lock(&hwpt->device_lock);
> 	...
> 	if (iommufd_hw_pagetabe_has_group()) {	// touching device list
> 		...
> 		iommu_group_replace_domain();
> 		...
> 	}
> 	if (cur_hwpt && hwpt)
> 		list_del(&idev->devices_item);
> 	list_add(&idev->devices_item, &cur_hwpt->devices);
> 	...
> 	mutex_unlock(&hwpt->device_lock);
> 	if (cur_hwpt != hwpt)
> 		mutex_unlock(&cur_hwpt->device_lock);

What is the issue? That isn't quite right, but the basic bit is fine

If you want to do replace then you have to hold both devices_lock and
you write that super ugly thing like this

lock_both:
   if (hwpt_a < hwpt_b) {
      mutex_lock(&hwpt_a->devices_lock);
      mutex_lock_nested(&hwpt_b->devices_lock);
   } else if (hwpt_a > hwpt_b) {
      mutex_lock(&hwpt_b->devices_lock);
      mutex_lock_nested(&hwpt_a->devices_lock);
   } else
      mutex_lock(&hwpt_a->devices_lock);

And then it is trivial, yes?

Using the group_lock in the iommu core is the right way to fix
this.. Maybe someday we can do that.

(also document that replace causes all the devices in the group to
change iommu_domains at once)

> I just gave another thought about it. Since we have the patch-2
> from this series moving the ioas->mutex, it already serializes
> attach/detach routines. And I see that all the places touching
> idev->device_item and hwpt->devices are protected by ioas->mutex.
> So, perhaps we can simply remove the device_lock?

The two hwpts are not required to have the same ioas, so this doesn't
really help..

Jason
  
Nicolin Chen Feb. 1, 2023, 7:25 p.m. UTC | #17
On Wed, Feb 01, 2023 at 02:37:42PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote:
> > > So the issue is with replace you need to have the domain populated
> > > before we can call replace but you can't populate the domain until it
> > > is bound because of the above issue? That seems unsovlable without
> > > fixing up the driver.
> > 
> > Not really. A REPLACE ioctl is just an ATTACH, if the device just
> > gets BIND-ed. So the SMMU driver will initialize ("finalise") the
> > domain during the replace() call, then iopt_table_add_domain() can
> > be done.
> > 
> > So, not a blocker here.
> 
> Well, yes, there sort of is because the whole flow becomes nonsensical
> - we are supposed to have the iommu_domain populated by the time we do
> replace. Otherwise replace is extra-pointless..

The "finalise" is one of the very first lines of the attach_dev()
callback function in SMMU driver, though it might still undesirably
fail the replace().

https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1eb00149912145639
Btw, this is a draft that I made to move iopt_table_add_domain(). I
think we can have this with the nesting series.

Later, once we pass in the dev pointer to the ->domain_alloc op
using Robin's change, all the iopt_table_add_domain() can be done
within the hwpt_alloc(), prior to an attach()/replace().

> > > Is there another issue?
> > 
> > Oh. I think we mixed the topics here. These three patches were
> > not to unblock but to clean up a way for the replace series and
> > the nesting series, for the device locking issue:
> > 
> > 	if (cur_hwpt != hwpt)
> > 		mutex_lock(&cur_hwpt->device_lock);
> > 	mutex_lock(&hwpt->device_lock);
> > 	...
> > 	if (iommufd_hw_pagetabe_has_group()) {	// touching device list
> > 		...
> > 		iommu_group_replace_domain();
> > 		...
> > 	}
> > 	if (cur_hwpt && hwpt)
> > 		list_del(&idev->devices_item);
> > 	list_add(&idev->devices_item, &cur_hwpt->devices);
> > 	...
> > 	mutex_unlock(&hwpt->device_lock);
> > 	if (cur_hwpt != hwpt)
> > 		mutex_unlock(&cur_hwpt->device_lock);
> 
> What is the issue? That isn't quite right, but the basic bit is fine
> 
> If you want to do replace then you have to hold both devices_lock and
> you write that super ugly thing like this
> 
> lock_both:
>    if (hwpt_a < hwpt_b) {
>       mutex_lock(&hwpt_a->devices_lock);
>       mutex_lock_nested(&hwpt_b->devices_lock);
>    } else if (hwpt_a > hwpt_b) {
>       mutex_lock(&hwpt_b->devices_lock);
>       mutex_lock_nested(&hwpt_a->devices_lock);
>    } else
>       mutex_lock(&hwpt_a->devices_lock);
> 
> And then it is trivial, yes?

Yea. That's your previous remark.

> Using the group_lock in the iommu core is the right way to fix
> this.. Maybe someday we can do that.
> 
> (also document that replace causes all the devices in the group to
> change iommu_domains at once)

Yes. There's a discussion in PATCH-3 of this series. I drafted a
patch changing iommu_attach/detach_dev():
https://github.com/nicolinc/iommufd/commit/124f7804ef38d50490b606fd56c1e27ce551a839

Baolu had a similar patch series a year ago. So we might continue
that effort in parallel, and eventually drop the device list/lock.

> > I just gave another thought about it. Since we have the patch-2
> > from this series moving the ioas->mutex, it already serializes
> > attach/detach routines. And I see that all the places touching
> > idev->device_item and hwpt->devices are protected by ioas->mutex.
> > So, perhaps we can simply remove the device_lock?
> 
> The two hwpts are not required to have the same ioas, so this doesn't
> really help..

Hmm...in that case, we should hold two ioas->mutex locks in
addition to two device locks?

Thanks
Nic
  
Jason Gunthorpe Feb. 1, 2023, 8 p.m. UTC | #18
On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote:

> The "finalise" is one of the very first lines of the attach_dev()
> callback function in SMMU driver, though it might still undesirably
> fail the replace().

It won't get that far.

Remember how this all works - only autodomains have the special path
that allocates a domain, attaches the empty domain, and then populates
it with the ioas. We made this special path specifically to accomodate
the current ARM drivers, otherwise they wouldn't work at all.

replace can't do this - replace must always start out with a
pre-existing hwpt that was allocated with a dedicated hwpt allocation
ioctl.

Wwhen the hwpt was allocated it must be linked to the IOAS at that
time, because we definately don't do defered IOAS linkage.

So on ARM when you create an unfinalizes iommu_domain it cannot be
added to the IOAS (because it has an empty aperture) and creation will
fail, or worse, it will get added to an empty IOAS and make the IOAS
permanently unusable.

> Hmm...in that case, we should hold two ioas->mutex locks in
> addition to two device locks?

No, the device lock is the thing that protects the data you are
touching no reason to make it any different.

Jason
  
Nicolin Chen Feb. 1, 2023, 9:18 p.m. UTC | #19
On Wed, Feb 01, 2023 at 04:00:40PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote:
> 
> > The "finalise" is one of the very first lines of the attach_dev()
> > callback function in SMMU driver, though it might still undesirably
> > fail the replace().
> 
> It won't get that far.
> 
> Remember how this all works - only autodomains have the special path
> that allocates a domain, attaches the empty domain, and then populates
> it with the ioas. We made this special path specifically to accomodate
> the current ARM drivers, otherwise they wouldn't work at all.

Yes.

> replace can't do this - replace must always start out with a
> pre-existing hwpt that was allocated with a dedicated hwpt allocation
> ioctl.
> 
> Wwhen the hwpt was allocated it must be linked to the IOAS at that
> time, because we definately don't do defered IOAS linkage.
>
> So on ARM when you create an unfinalizes iommu_domain it cannot be
> added to the IOAS (because it has an empty aperture) and creation will
> fail, or worse, it will get added to an empty IOAS and make the IOAS
> permanently unusable.

IIUIC, user space might add some IOVA mappings to the hwpt/iopt,
before doing a replace(). If we do a deferred IOAS linkage to
this hwpt/iopt, it will cause a problem because we are adding
the reserved region for the MSI window later than IOMMU_IOAS_MAP
calls. Thus, "we definately don't do defered IOAS linkage".

With this justification, I think I should include my patch of
moving iopt_table_add/remove_domain(), in the replace series.

> > Hmm...in that case, we should hold two ioas->mutex locks in
> > addition to two device locks?
> 
> No, the device lock is the thing that protects the data you are
> touching no reason to make it any different.

Thinking it deeper: we don't really touch the old_hwpt->ioas or
its iopt, but only the new_hwpt->ioas/iopt to reserve the MSI
window, so there isn't an actual need to hold old_hwpt->ioas's
mutex.

I will prepare the replace series and send it by the end of the
day, upon certain level of sanity/verification.

Thanks!
Nic
  
Nicolin Chen Feb. 2, 2023, 7:28 a.m. UTC | #20
On Wed, Feb 01, 2023 at 01:18:08PM -0800, Nicolin Chen wrote:
> On Wed, Feb 01, 2023 at 04:00:40PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote:
> > 
> > > The "finalise" is one of the very first lines of the attach_dev()
> > > callback function in SMMU driver, though it might still undesirably
> > > fail the replace().
> > 
> > It won't get that far.
> > 
> > Remember how this all works - only autodomains have the special path
> > that allocates a domain, attaches the empty domain, and then populates
> > it with the ioas. We made this special path specifically to accomodate
> > the current ARM drivers, otherwise they wouldn't work at all.
> 
> Yes.
> 
> > replace can't do this - replace must always start out with a
> > pre-existing hwpt that was allocated with a dedicated hwpt allocation
> > ioctl.
> > 
> > Wwhen the hwpt was allocated it must be linked to the IOAS at that
> > time, because we definately don't do defered IOAS linkage.
> >
> > So on ARM when you create an unfinalizes iommu_domain it cannot be
> > added to the IOAS (because it has an empty aperture) and creation will
> > fail, or worse, it will get added to an empty IOAS and make the IOAS
> > permanently unusable.
> 
> IIUIC, user space might add some IOVA mappings to the hwpt/iopt,
> before doing a replace(). If we do a deferred IOAS linkage to
> this hwpt/iopt, it will cause a problem because we are adding
> the reserved region for the MSI window later than IOMMU_IOAS_MAP
> calls. Thus, "we definately don't do defered IOAS linkage".
> 
> With this justification, I think I should include my patch of
> moving iopt_table_add/remove_domain(), in the replace series.

I just posted the replace series. But I found that the base
line (without the nesting series) allocates iommu_domains
always with the ->domain_alloc() op. So, we cannot actually
move the iopt_table_add_domain() in the replace series, as I
intended to.

Yet, a great news is that our nesting series replaces the
domain_alloc() op entirely with ->domain_alloc_user() for all
the iommu_domain allocations, including for auto_domains. So,
we can completely move iopt_table_add_domain() to the hwpt
allocation function. And we don't really need a big change
in the SMMU driver nor Robin's patch that passes in dev ptr
to domain_alloc() op. And even this device_users refcount in
this patch is no longer needed. It also simplifies the shared
device locking situation, if I am not missing anything.

So, in short, we'll have to wait for ->domain_alloc_user()
patch (in the nesting series) to unblock the problem that we
discussed above regarding the iopt_table_add_domain().

Thanks
Nic
  
Nicolin Chen Feb. 2, 2023, 9:12 a.m. UTC | #21
On Tue, Jan 31, 2023 at 11:48:04PM -0800, Nicolin Chen wrote:
> On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote:
> > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote:
> > > 
> > > > I recall we've discussed this that SMMU sets up domain when it
> > > > attaches the device to, so we made a compromise here...
> > > 
> > > The ARM driver has a problem that it doesn't know what SMMU instance
> > > will host the domain when it is allocated so it doesn't know if it
> > > should select a S1 or S2 page table format - which is determined by
> > > the capabilities of the specific SMMU HW block.
> > > 
> > > However, we don't have this problem when creating the S2. The S2
> > > should be created by a special alloc_domain_iommufd() asking for the
> > > S2 format. Not only does the new alloc_domain_iommufd API directly
> > > request a S2 format table, but it also specifies the struct device so
> > > any residual details can be determined directly.
> > > 
> > > Thus there is no need to do the two stage process when working with
> > > the S2.
> > 
> > Ah, right! Taking a quick look, we should be able to call that
> > arm_smmu_domain_finalise when handling alloc_domain_iommufd().
> > 
> > > So fixup the driver to create fully configured iommu_domain's
> > > immediately and get rid of this problem.
> > 
> > OK. I will draft a patch today.
> 
> @Yi
> Do you recall doing iopt_table_add_domain() in hwpt_alloc()?
> 
> Jason has a great point above. So even SMMU should be able to
> call the iopt_table_add_domain() after a kernel-manged hwpt
> allocation rather than after an iommu_attach_group(), except
> an auto_domain or a selftest mock_domain that still needs to
> attach the device first, otherwise the SMMU driver (currently)
> cannot finalize the domain aperture.

Some update today: I found ops->domain_alloc_user() is used
for all domain allocations inside IOMMUFD. So, without any
special case, we could entirely do iopt_table_add_domain()
in the __iommufd_hw_pagetable_alloc() and accordingly do
iopt_table_remove_domain() in the hw_pagetable_destroy():
https://github.com/nicolinc/iommufd/commit/85248e1c5f645e1eb701562eb078cf586af617fe
(We can also skip that "symmetric" patch for the list_add,
 moving the list_add/del calls directly to alloc/destroy.)

Without the complication of the add/remove_domain() calls
in the do_attach/detach() functions, there is no need for
the device_users counter any more.

I am not 100% sure if we still need the shared device lock,
though so far the sanity that I run doesn't show a problem.
We may discuss about it later when we converge our branches.
As before, I am also okay to do in the way with incremental
changes on top of your tree and to ask you to integrate,
once you have your branch ready.

My full wip branch:
https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc5-nesting

Thanks
Nic
  
Jason Gunthorpe Feb. 2, 2023, 3:03 p.m. UTC | #22
On Wed, Feb 01, 2023 at 11:28:05PM -0800, Nicolin Chen wrote:

> So, in short, we'll have to wait for ->domain_alloc_user()
> patch (in the nesting series) to unblock the problem that we
> discussed above regarding the iopt_table_add_domain().

I think that is probably OK

It would be good to prepare a SMMUv3 patch assuming Robin's stuff
lands to move the domain iopt determination to allocate so we are
ready.

Jason
  
Yi Liu Feb. 7, 2023, 4:19 a.m. UTC | #23
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, February 2, 2023 5:12 PM
> 
> On Tue, Jan 31, 2023 at 11:48:04PM -0800, Nicolin Chen wrote:
> > On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote:
> > > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote:
> > > > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote:
> > > >
> > > > > I recall we've discussed this that SMMU sets up domain when it
> > > > > attaches the device to, so we made a compromise here...
> > > >
> > > > The ARM driver has a problem that it doesn't know what SMMU
> instance
> > > > will host the domain when it is allocated so it doesn't know if it
> > > > should select a S1 or S2 page table format - which is determined by
> > > > the capabilities of the specific SMMU HW block.
> > > >
> > > > However, we don't have this problem when creating the S2. The S2
> > > > should be created by a special alloc_domain_iommufd() asking for the
> > > > S2 format. Not only does the new alloc_domain_iommufd API directly
> > > > request a S2 format table, but it also specifies the struct device so
> > > > any residual details can be determined directly.
> > > >
> > > > Thus there is no need to do the two stage process when working with
> > > > the S2.
> > >
> > > Ah, right! Taking a quick look, we should be able to call that
> > > arm_smmu_domain_finalise when handling alloc_domain_iommufd().
> > >
> > > > So fixup the driver to create fully configured iommu_domain's
> > > > immediately and get rid of this problem.
> > >
> > > OK. I will draft a patch today.
> >
> > @Yi
> > Do you recall doing iopt_table_add_domain() in hwpt_alloc()?

Yeah, doing iopt_table_add_domain() in hwpt_alloc suits well.
The only reason for current code is SMMU has drawback with it.
Great to see it is solved.

> > Jason has a great point above. So even SMMU should be able to
> > call the iopt_table_add_domain() after a kernel-manged hwpt
> > allocation rather than after an iommu_attach_group(), except
> > an auto_domain or a selftest mock_domain that still needs to
> > attach the device first, otherwise the SMMU driver (currently)
> > cannot finalize the domain aperture.
> 
> Some update today: I found ops->domain_alloc_user() is used
> for all domain allocations inside IOMMUFD. So, without any
> special case, we could entirely do iopt_table_add_domain()
> in the __iommufd_hw_pagetable_alloc() and accordingly do
> iopt_table_remove_domain() in the hw_pagetable_destroy():
> https://github.com/nicolinc/iommufd/commit/85248e1c5f645e1eb701562e
> b078cf586af617fe
> (We can also skip that "symmetric" patch for the list_add,
>  moving the list_add/del calls directly to alloc/destroy.)
> 
> Without the complication of the add/remove_domain() calls
> in the do_attach/detach() functions, there is no need for
> the device_users counter any more.

Yes.

> I am not 100% sure if we still need the shared device lock,
> though so far the sanity that I run doesn't show a problem.
> We may discuss about it later when we converge our branches.
> As before, I am also okay to do in the way with incremental
> changes on top of your tree and to ask you to integrate,
> once you have your branch ready.

I think reusing the device lock can simplify things to avoid bad
readability. However, I'll do a double-check.

> My full wip branch:
> https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc5-
> nesting

Thanks, I'm now re-integrating the nesting patches.

Regards,
Yi Liu
  
Yi Liu Feb. 7, 2023, 4:27 a.m. UTC | #24
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, February 2, 2023 3:25 AM
> 
> On Wed, Feb 01, 2023 at 02:37:42PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote:
> > > > So the issue is with replace you need to have the domain populated
> > > > before we can call replace but you can't populate the domain until it
> > > > is bound because of the above issue? That seems unsovlable without
> > > > fixing up the driver.
> > >
> > > Not really. A REPLACE ioctl is just an ATTACH, if the device just
> > > gets BIND-ed. So the SMMU driver will initialize ("finalise") the
> > > domain during the replace() call, then iopt_table_add_domain() can
> > > be done.
> > >
> > > So, not a blocker here.
> >
> > Well, yes, there sort of is because the whole flow becomes nonsensical
> > - we are supposed to have the iommu_domain populated by the time we
> do
> > replace. Otherwise replace is extra-pointless..
> 
> The "finalise" is one of the very first lines of the attach_dev()
> callback function in SMMU driver, though it might still undesirably
> fail the replace().
> 
> https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1
> eb00149912145639
> Btw, this is a draft that I made to move iopt_table_add_domain(). I
> think we can have this with the nesting series.
> 
> Later, once we pass in the dev pointer to the ->domain_alloc op
> using Robin's change, all the iopt_table_add_domain() can be done
> within the hwpt_alloc(), prior to an attach()/replace().
> 
> > > > Is there another issue?
> > >
> > > Oh. I think we mixed the topics here. These three patches were
> > > not to unblock but to clean up a way for the replace series and
> > > the nesting series, for the device locking issue:
> > >
> > > 	if (cur_hwpt != hwpt)
> > > 		mutex_lock(&cur_hwpt->device_lock);
> > > 	mutex_lock(&hwpt->device_lock);
> > > 	...
> > > 	if (iommufd_hw_pagetabe_has_group()) {	// touching device
> list
> > > 		...
> > > 		iommu_group_replace_domain();
> > > 		...
> > > 	}
> > > 	if (cur_hwpt && hwpt)
> > > 		list_del(&idev->devices_item);
> > > 	list_add(&idev->devices_item, &cur_hwpt->devices);
> > > 	...
> > > 	mutex_unlock(&hwpt->device_lock);
> > > 	if (cur_hwpt != hwpt)
> > > 		mutex_unlock(&cur_hwpt->device_lock);
> >
> > What is the issue? That isn't quite right, but the basic bit is fine
> >
> > If you want to do replace then you have to hold both devices_lock and
> > you write that super ugly thing like this
> >
> > lock_both:
> >    if (hwpt_a < hwpt_b) {
> >       mutex_lock(&hwpt_a->devices_lock);
> >       mutex_lock_nested(&hwpt_b->devices_lock);
> >    } else if (hwpt_a > hwpt_b) {
> >       mutex_lock(&hwpt_b->devices_lock);
> >       mutex_lock_nested(&hwpt_a->devices_lock);
> >    } else
> >       mutex_lock(&hwpt_a->devices_lock);
> >
> > And then it is trivial, yes?
> 
> Yea. That's your previous remark.
> 
> > Using the group_lock in the iommu core is the right way to fix
> > this.. Maybe someday we can do that.
> >
> > (also document that replace causes all the devices in the group to
> > change iommu_domains at once)
> 
> Yes. There's a discussion in PATCH-3 of this series. I drafted a
> patch changing iommu_attach/detach_dev():
> https://github.com/nicolinc/iommufd/commit/124f7804ef38d50490b606fd5
> 6c1e27ce551a839
> 
> Baolu had a similar patch series a year ago. So we might continue
> that effort in parallel, and eventually drop the device list/lock.
> 
> > > I just gave another thought about it. Since we have the patch-2
> > > from this series moving the ioas->mutex, it already serializes
> > > attach/detach routines. And I see that all the places touching
> > > idev->device_item and hwpt->devices are protected by ioas->mutex.
> > > So, perhaps we can simply remove the device_lock?
> >
> > The two hwpts are not required to have the same ioas, so this doesn't
> > really help..
> 
> Hmm...in that case, we should hold two ioas->mutex locks in
> addition to two device locks?

Aha, seems so. You can replace a s1 hwpt with another s1 hwpt which
Has a different ioas. 😊 maybe this is something incremental to nesting
series. In nesting series, between ioas and s1 hwpt, they can share
the device_lock. Isn't it?

Regards,
Yi Liu
  

Patch

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 9f3b9674d72e..208757c39c90 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -212,7 +212,7 @@  static int iommufd_device_do_attach(struct iommufd_device *idev,
 				hwpt->domain->ops->enforce_cache_coherency(
 					hwpt->domain);
 		if (!hwpt->enforce_cache_coherency) {
-			WARN_ON(list_empty(&hwpt->devices));
+			WARN_ON(refcount_read(hwpt->devices_users) == 1);
 			rc = -EINVAL;
 			goto out_unlock;
 		}
@@ -236,7 +236,7 @@  static int iommufd_device_do_attach(struct iommufd_device *idev,
 		if (rc)
 			goto out_iova;
 
-		if (list_empty(&hwpt->devices)) {
+		if (refcount_read(hwpt->devices_users) == 1) {
 			rc = iopt_table_add_domain(&hwpt->ioas->iopt,
 						   hwpt->domain);
 			if (rc)
@@ -246,6 +246,7 @@  static int iommufd_device_do_attach(struct iommufd_device *idev,
 
 	idev->hwpt = hwpt;
 	refcount_inc(&hwpt->obj.users);
+	refcount_inc(hwpt->devices_users);
 	list_add(&idev->devices_item, &hwpt->devices);
 	mutex_unlock(&hwpt->devices_lock);
 	return 0;
@@ -387,9 +388,10 @@  void iommufd_device_detach(struct iommufd_device *idev)
 
 	mutex_lock(&hwpt->ioas->mutex);
 	mutex_lock(&hwpt->devices_lock);
+	refcount_dec(hwpt->devices_users);
 	list_del(&idev->devices_item);
 	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
-		if (list_empty(&hwpt->devices)) {
+		if (refcount_read(hwpt->devices_users) == 1) {
 			iopt_table_remove_domain(&hwpt->ioas->iopt,
 						 hwpt->domain);
 			list_del(&hwpt->hwpt_item);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 43d473989a06..910e759ffeac 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -16,6 +16,8 @@  void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
 	iommu_domain_free(hwpt->domain);
 	refcount_dec(&hwpt->ioas->obj.users);
 	mutex_destroy(&hwpt->devices_lock);
+	WARN_ON(!refcount_dec_if_one(hwpt->devices_users));
+	kfree(hwpt->devices_users);
 }
 
 /**
@@ -46,11 +48,20 @@  iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	INIT_LIST_HEAD(&hwpt->devices);
 	INIT_LIST_HEAD(&hwpt->hwpt_item);
 	mutex_init(&hwpt->devices_lock);
+	hwpt->devices_users = kzalloc(sizeof(*hwpt->devices_users), GFP_KERNEL);
+	if (!hwpt->devices_users) {
+		rc = -ENOMEM;
+		goto out_free_domain;
+	}
+	refcount_set(hwpt->devices_users, 1);
+
 	/* Pairs with iommufd_hw_pagetable_destroy() */
 	refcount_inc(&ioas->obj.users);
 	hwpt->ioas = ioas;
 	return hwpt;
 
+out_free_domain:
+	iommu_domain_free(hwpt->domain);
 out_abort:
 	iommufd_object_abort(ictx, &hwpt->obj);
 	return ERR_PTR(rc);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 222e86591f8a..f128d77fb076 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -247,6 +247,7 @@  struct iommufd_hw_pagetable {
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct mutex devices_lock;
+	refcount_t *devices_users;
 	struct list_head devices;
 };