[3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration

Message ID 20221018145132.998866-4-schnelle@linux.ibm.com
State New
Headers
Series iommu/s390: Further improvements |

Commit Message

Niklas Schnelle Oct. 18, 2022, 2:51 p.m. UTC
  The s390_domain->devices list is only added to when new devices are
attached but is iterated through in read-only fashion for every mapping
operation as well as for I/O TLB flushes and thus in performance
critical code causing contention on the s390_domain->list_lock.
Fortunately such a read-mostly linked list is a standard use case for
RCU. This change closely follows the example fpr RCU protected list
given in Documentation/RCU/listRCU.rst.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h |  1 +
 arch/s390/pci/pci.c         |  2 +-
 drivers/iommu/s390-iommu.c  | 31 ++++++++++++++++---------------
 3 files changed, 18 insertions(+), 16 deletions(-)
  

Comments

Jason Gunthorpe Oct. 18, 2022, 3:18 p.m. UTC | #1
On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:

> @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
>  		return;
>  
>  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> -	list_del_init(&zdev->iommu_list);
> +	list_del_rcu(&zdev->iommu_list);
>  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);

This doesn't seem obviously OK, the next steps remove the translation
while we can still have concurrent RCU protected flushes going on.

Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?

Jason
  
Niklas Schnelle Oct. 19, 2022, 8:31 a.m. UTC | #2
On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> 
> > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> >  		return;
> >  
> >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > -	list_del_init(&zdev->iommu_list);
> > +	list_del_rcu(&zdev->iommu_list);
> >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> 
> This doesn't seem obviously OK, the next steps remove the translation
> while we can still have concurrent RCU protected flushes going on.
> 
> Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> 
> Jason

Interesting point. So for the flushes themselves this should be fine,
once the zpci_unregister_ioat() is executed all subsequent and ongoing
IOTLB flushes should return an error code without further adverse
effects. Though I think we do still have an issue in the IOTLB ops for
this case as that error would skip the IOTLB flushes of other attached
devices.

The bigger question and that seems independent from RCU is how/if
detach is supposed to work if there are still DMAs ongoing. Once we do
the zpci_unregister_ioat() any DMA request coming from the PCI device
will be blocked and will lead to the PCI device being isolated (put
into an error state) for attempting an invalid DMA. So I had expected
that calls of detach/attach would happen without expected ongoing DMAs
and thus IOTLB flushes? Of course we should be robust against
violations of that and unexpected DMAs for which I think isolating the
PCI device is the correct response. What am I missing?
  
Jason Gunthorpe Oct. 19, 2022, 11:53 a.m. UTC | #3
On Wed, Oct 19, 2022 at 10:31:21AM +0200, Niklas Schnelle wrote:
> On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> > 
> > > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> > >  		return;
> > >  
> > >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > -	list_del_init(&zdev->iommu_list);
> > > +	list_del_rcu(&zdev->iommu_list);
> > >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > 
> > This doesn't seem obviously OK, the next steps remove the translation
> > while we can still have concurrent RCU protected flushes going on.
> > 
> > Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> > 
> > Jason
> 
> Interesting point. So for the flushes themselves this should be fine,
> once the zpci_unregister_ioat() is executed all subsequent and ongoing
> IOTLB flushes should return an error code without further adverse
> effects. Though I think we do still have an issue in the IOTLB ops for
> this case as that error would skip the IOTLB flushes of other attached
> devices.

That sounds bad


> The bigger question and that seems independent from RCU is how/if
> detach is supposed to work if there are still DMAs ongoing. Once we do
> the zpci_unregister_ioat() any DMA request coming from the PCI device
> will be blocked and will lead to the PCI device being isolated (put
> into an error state) for attempting an invalid DMA. So I had expected
> that calls of detach/attach would happen without expected ongoing DMAs
> and thus IOTLB flushes? 

"ongoing DMA" from this device shouuld be stopped, it doesn't mean
that the other devices attached to the same domain are not also still
operating and also still having flushes. So now that it is RCU a flush
triggered by a different device will continue to see this now disabled
device and try to flush it until the grace period.

Jason
  
Niklas Schnelle Oct. 20, 2022, 8:51 a.m. UTC | #4
On Wed, 2022-10-19 at 08:53 -0300, Jason Gunthorpe wrote:
> On Wed, Oct 19, 2022 at 10:31:21AM +0200, Niklas Schnelle wrote:
> > On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> > > 
> > > > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> > > >  		return;
> > > >  
> > > >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > > -	list_del_init(&zdev->iommu_list);
> > > > +	list_del_rcu(&zdev->iommu_list);
> > > >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > > 
> > > This doesn't seem obviously OK, the next steps remove the translation
> > > while we can still have concurrent RCU protected flushes going on.
> > > 
> > > Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> > > 
> > > Jason
> > 
> > Interesting point. So for the flushes themselves this should be fine,
> > once the zpci_unregister_ioat() is executed all subsequent and ongoing
> > IOTLB flushes should return an error code without further adverse
> > effects. Though I think we do still have an issue in the IOTLB ops for
> > this case as that error would skip the IOTLB flushes of other attached
> > devices.
> 
> That sounds bad

Thankfully it's very easy to fix since our IOTLB flushes are per PCI
function, I just need to continue the loop in the IOTLB ops on error
instead of breaking out of it and skipping the other devices. Makes no
sense anyway to skip  devices just because there is an error on another
device.

> 
> 
> > The bigger question and that seems independent from RCU is how/if
> > detach is supposed to work if there are still DMAs ongoing. Once we do
> > the zpci_unregister_ioat() any DMA request coming from the PCI device
> > will be blocked and will lead to the PCI device being isolated (put
> > into an error state) for attempting an invalid DMA. So I had expected
> > that calls of detach/attach would happen without expected ongoing DMAs
> > and thus IOTLB flushes? 
> 
> "ongoing DMA" from this device shouuld be stopped, it doesn't mean
> that the other devices attached to the same domain are not also still
> operating and also still having flushes. So now that it is RCU a flush
> triggered by a different device will continue to see this now disabled
> device and try to flush it until the grace period.
> 
> Jason

Ok that makes sense thanks for the explanation. So yes my assessment is
still that in this situation the IOTLB flush is architected to return
an error that we can ignore. Not the most elegant I admit but at least
it's simple. Alternatively I guess we could use call_rcu() to do the
zpci_unregister_ioat() but I'm not sure how to then make sure that a
subsequent zpci_register_ioat() only happens after that without adding
too much more logic.
  
Jason Gunthorpe Oct. 20, 2022, 11:05 a.m. UTC | #5
On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:

> Ok that makes sense thanks for the explanation. So yes my assessment is
> still that in this situation the IOTLB flush is architected to return
> an error that we can ignore. Not the most elegant I admit but at least
> it's simple. Alternatively I guess we could use call_rcu() to do the
> zpci_unregister_ioat() but I'm not sure how to then make sure that a
> subsequent zpci_register_ioat() only happens after that without adding
> too much more logic.

This won't work either as the domain could have been freed before the
call_rcu() happens, the domain needs to be detached synchronously

Jason
  
Niklas Schnelle Oct. 21, 2022, 12:08 p.m. UTC | #6
On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> 
> > Ok that makes sense thanks for the explanation. So yes my assessment is
> > still that in this situation the IOTLB flush is architected to return
> > an error that we can ignore. Not the most elegant I admit but at least
> > it's simple. Alternatively I guess we could use call_rcu() to do the
> > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > subsequent zpci_register_ioat() only happens after that without adding
> > too much more logic.
> 
> This won't work either as the domain could have been freed before the
> call_rcu() happens, the domain needs to be detached synchronously
> 
> Jason

Yeah right, that is basically the same issue I was thinking of for a
subsequent zpci_register_ioat(). What about the obvious one. Just call
synchronize_rcu() before zpci_unregister_ioat()?
  
Jason Gunthorpe Oct. 21, 2022, 1:36 p.m. UTC | #7
On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > 
> > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > still that in this situation the IOTLB flush is architected to return
> > > an error that we can ignore. Not the most elegant I admit but at least
> > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > subsequent zpci_register_ioat() only happens after that without adding
> > > too much more logic.
> > 
> > This won't work either as the domain could have been freed before the
> > call_rcu() happens, the domain needs to be detached synchronously
> > 
> > Jason
> 
> Yeah right, that is basically the same issue I was thinking of for a
> subsequent zpci_register_ioat(). What about the obvious one. Just call
> synchronize_rcu() before zpci_unregister_ioat()?

Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
to complete in some cases.

What you have seems like it could be OK, just deal with the ugly racy
failure

Jason
  
Niklas Schnelle Oct. 21, 2022, 3:01 p.m. UTC | #8
On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > 
> > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > still that in this situation the IOTLB flush is architected to return
> > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > too much more logic.
> > > 
> > > This won't work either as the domain could have been freed before the
> > > call_rcu() happens, the domain needs to be detached synchronously
> > > 
> > > Jason
> > 
> > Yeah right, that is basically the same issue I was thinking of for a
> > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > synchronize_rcu() before zpci_unregister_ioat()?
> 
> Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> to complete in some cases.
> 
> What you have seems like it could be OK, just deal with the ugly racy
> failure
> 
> Jason

I'd tend to go with synchronize_rcu(). It won't leave us with spurious
error logs for the failed IOTLB flushes and as you said one expects
detach to be synchronous. I don't think waiting in it will be a
problem. But this is definitely something you're more of an expert on
so I'll trust your judgement. Looking at other callers of
synchronize_rcu() quite a few of them look to be in similar
detach/release kind of situations though not sure how frequent and
performance critical IOMMU domain detaching is in comparison.

Thanks,
Niklas
  
Jason Gunthorpe Oct. 21, 2022, 3:04 p.m. UTC | #9
On Fri, Oct 21, 2022 at 05:01:32PM +0200, Niklas Schnelle wrote:
> On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > still that in this situation the IOTLB flush is architected to return
> > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > too much more logic.
> > > > 
> > > > This won't work either as the domain could have been freed before the
> > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > 
> > > > Jason
> > > 
> > > Yeah right, that is basically the same issue I was thinking of for a
> > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > synchronize_rcu() before zpci_unregister_ioat()?
> > 
> > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > to complete in some cases.
> > 
> > What you have seems like it could be OK, just deal with the ugly racy
> > failure
> > 
> > Jason
> 
> I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> error logs for the failed IOTLB flushes and as you said one expects
> detach to be synchronous. I don't think waiting in it will be a
> problem. But this is definitely something you're more of an expert on
> so I'll trust your judgement. Looking at other callers of
> synchronize_rcu() quite a few of them look to be in similar
> detach/release kind of situations though not sure how frequent and
> performance critical IOMMU domain detaching is in comparison.

I would not do it on domain detaching, that is something triggered by
userspace through VFIO and it could theoritically happen alot, eg in
vIOMMU scenarios.

Jason
  
Niklas Schnelle Oct. 21, 2022, 3:05 p.m. UTC | #10
On Fri, 2022-10-21 at 17:01 +0200, Niklas Schnelle wrote:
> On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > still that in this situation the IOTLB flush is architected to return
> > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > too much more logic.
> > > > 
> > > > This won't work either as the domain could have been freed before the
> > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > 
> > > > Jason
> > > 
> > > Yeah right, that is basically the same issue I was thinking of for a
> > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > synchronize_rcu() before zpci_unregister_ioat()?
> > 
> > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > to complete in some cases.
> > 
> > What you have seems like it could be OK, just deal with the ugly racy
> > failure
> > 
> > Jason
> 
> I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> error logs for the failed IOTLB flushes and as you said one expects
> detach to be synchronous. I don't think waiting in it will be a
> problem. But this is definitely something you're more of an expert on
> so I'll trust your judgement. Looking at other callers of
> synchronize_rcu() quite a few of them look to be in similar
> detach/release kind of situations though not sure how frequent and
> performance critical IOMMU domain detaching is in comparison.
> 
> Thanks,
> Niklas
> 

Addendum, of course independently of whether to use synchronize_rcu()
I'll change the error handling in the IOTLB ops to not skip over the
other devices.
  
Niklas Schnelle Oct. 24, 2022, 3:22 p.m. UTC | #11
On Fri, 2022-10-21 at 12:04 -0300, Jason Gunthorpe wrote:
> On Fri, Oct 21, 2022 at 05:01:32PM +0200, Niklas Schnelle wrote:
> > On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > > 
> > > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > > still that in this situation the IOTLB flush is architected to return
> > > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > > too much more logic.
> > > > > 
> > > > > This won't work either as the domain could have been freed before the
> > > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > > 
> > > > > Jason
> > > > 
> > > > Yeah right, that is basically the same issue I was thinking of for a
> > > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > > synchronize_rcu() before zpci_unregister_ioat()?
> > > 
> > > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > > to complete in some cases.
> > > 
> > > What you have seems like it could be OK, just deal with the ugly racy
> > > failure
> > > 
> > > Jason
> > 
> > I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> > error logs for the failed IOTLB flushes and as you said one expects
> > detach to be synchronous. I don't think waiting in it will be a
> > problem. But this is definitely something you're more of an expert on
> > so I'll trust your judgement. Looking at other callers of
> > synchronize_rcu() quite a few of them look to be in similar
> > detach/release kind of situations though not sure how frequent and
> > performance critical IOMMU domain detaching is in comparison.
> 
> I would not do it on domain detaching, that is something triggered by
> userspace through VFIO and it could theoritically happen alot, eg in
> vIOMMU scenarios.
> 
> Jason

Thanks for the explanation, still would like to grok this a bit more if
you don't mind. If I do read things correctly synchronize_rcu() should
run in the conext of the VFIO ioctl in this case and shouldn't block
anything else in the kernel, correct? At least that's how I understand
the synchronize_rcu() comments and the fact that e.g.
net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
synchronize_rcu() and can be triggered from user-space too.

So we're
more worried about user-space getting slowed down rather than a Denial-
of-Service against other kernel tasks.
  
Jason Gunthorpe Oct. 24, 2022, 4:26 p.m. UTC | #12
On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:

> Thanks for the explanation, still would like to grok this a bit more if
> you don't mind. If I do read things correctly synchronize_rcu() should
> run in the conext of the VFIO ioctl in this case and shouldn't block
> anything else in the kernel, correct? At least that's how I understand
> the synchronize_rcu() comments and the fact that e.g.
> net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> synchronize_rcu() and can be triggered from user-space too.

Yes, but I wouldn't look in the kernel to understand if things are OK
 
> So we're
> more worried about user-space getting slowed down rather than a Denial-
> of-Service against other kernel tasks.

Yes, functionally it is OK, but for something like vfio with vIOMMU
you could be looking at several domains that have to be detached
sequentially and with grace periods > 1s you can reach multiple
seconds to complete something like a close() system call. Generally it
should be weighed carefully

Jason
  
Niklas Schnelle Oct. 27, 2022, 12:44 p.m. UTC | #13
On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> 
> > Thanks for the explanation, still would like to grok this a bit more if
> > you don't mind. If I do read things correctly synchronize_rcu() should
> > run in the conext of the VFIO ioctl in this case and shouldn't block
> > anything else in the kernel, correct? At least that's how I understand
> > the synchronize_rcu() comments and the fact that e.g.
> > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > synchronize_rcu() and can be triggered from user-space too.
> 
> Yes, but I wouldn't look in the kernel to understand if things are OK
>  
> > So we're
> > more worried about user-space getting slowed down rather than a Denial-
> > of-Service against other kernel tasks.
> 
> Yes, functionally it is OK, but for something like vfio with vIOMMU
> you could be looking at several domains that have to be detached
> sequentially and with grace periods > 1s you can reach multiple
> seconds to complete something like a close() system call. Generally it
> should be weighed carefully
> 
> Jason

Thanks for the detailed explanation. Then let's not put a
synchronize_rcu() in detach, as I said as long as the I/O translation
tables are there an IOTLB flush after zpci_unregister_ioat() should
result in an ignorable error. That said, I think if we don't have the
synchronize_rcu() in detach we need it in s390_domain_free() before
freeing the I/O translation tables.
  
Jason Gunthorpe Oct. 27, 2022, 12:56 p.m. UTC | #14
On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > 
> > > Thanks for the explanation, still would like to grok this a bit more if
> > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > anything else in the kernel, correct? At least that's how I understand
> > > the synchronize_rcu() comments and the fact that e.g.
> > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > synchronize_rcu() and can be triggered from user-space too.
> > 
> > Yes, but I wouldn't look in the kernel to understand if things are OK
> >  
> > > So we're
> > > more worried about user-space getting slowed down rather than a Denial-
> > > of-Service against other kernel tasks.
> > 
> > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > you could be looking at several domains that have to be detached
> > sequentially and with grace periods > 1s you can reach multiple
> > seconds to complete something like a close() system call. Generally it
> > should be weighed carefully
> > 
> > Jason
> 
> Thanks for the detailed explanation. Then let's not put a
> synchronize_rcu() in detach, as I said as long as the I/O translation
> tables are there an IOTLB flush after zpci_unregister_ioat() should
> result in an ignorable error. That said, I think if we don't have the
> synchronize_rcu() in detach we need it in s390_domain_free() before
> freeing the I/O translation tables.

Yes, it would be appropriate to free those using one of the rcu
free'rs, (eg kfree_rcu) not synchronize_rcu()

Jason
  
Niklas Schnelle Oct. 27, 2022, 1:35 p.m. UTC | #15
On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > 
> > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > anything else in the kernel, correct? At least that's how I understand
> > > > the synchronize_rcu() comments and the fact that e.g.
> > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > synchronize_rcu() and can be triggered from user-space too.
> > > 
> > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > >  
> > > > So we're
> > > > more worried about user-space getting slowed down rather than a Denial-
> > > > of-Service against other kernel tasks.
> > > 
> > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > you could be looking at several domains that have to be detached
> > > sequentially and with grace periods > 1s you can reach multiple
> > > seconds to complete something like a close() system call. Generally it
> > > should be weighed carefully
> > > 
> > > Jason
> > 
> > Thanks for the detailed explanation. Then let's not put a
> > synchronize_rcu() in detach, as I said as long as the I/O translation
> > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > result in an ignorable error. That said, I think if we don't have the
> > synchronize_rcu() in detach we need it in s390_domain_free() before
> > freeing the I/O translation tables.
> 
> Yes, it would be appropriate to free those using one of the rcu
> free'rs, (eg kfree_rcu) not synchronize_rcu()
> 
> Jason

They are allocated via kmem_cache_alloc() from caches shared by all
IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
entire I/O translation table of one IOMMU at once after it is not used
anymore. Before that it is only grown. So I think synchronize_rcu() is
the obvious and simple choice since we only need one grace period.
  
Jason Gunthorpe Oct. 27, 2022, 2:03 p.m. UTC | #16
On Thu, Oct 27, 2022 at 03:35:57PM +0200, Niklas Schnelle wrote:
> On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > > anything else in the kernel, correct? At least that's how I understand
> > > > > the synchronize_rcu() comments and the fact that e.g.
> > > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > > synchronize_rcu() and can be triggered from user-space too.
> > > > 
> > > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > > >  
> > > > > So we're
> > > > > more worried about user-space getting slowed down rather than a Denial-
> > > > > of-Service against other kernel tasks.
> > > > 
> > > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > > you could be looking at several domains that have to be detached
> > > > sequentially and with grace periods > 1s you can reach multiple
> > > > seconds to complete something like a close() system call. Generally it
> > > > should be weighed carefully
> > > > 
> > > > Jason
> > > 
> > > Thanks for the detailed explanation. Then let's not put a
> > > synchronize_rcu() in detach, as I said as long as the I/O translation
> > > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > > result in an ignorable error. That said, I think if we don't have the
> > > synchronize_rcu() in detach we need it in s390_domain_free() before
> > > freeing the I/O translation tables.
> > 
> > Yes, it would be appropriate to free those using one of the rcu
> > free'rs, (eg kfree_rcu) not synchronize_rcu()
> > 
> > Jason
> 
> They are allocated via kmem_cache_alloc() from caches shared by all
> IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
> entire I/O translation table of one IOMMU at once after it is not used
> anymore. Before that it is only grown. So I think synchronize_rcu() is
> the obvious and simple choice since we only need one grace period.

It has the same issue as doing it for the other reason, adding
synchronize_rcu() to the domain free path is undesirable.

The best thing is to do as kfree_rcu() does now, basically:

rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
if (!rcu_head)
   synchronize_rcu()
else
   call_rcu(rcu_head)

And then call kmem_cache_free() from the rcu callback

But this is getting very complicated, you might be better to refcount
the domain itself and acquire the refcount under RCU. This turns the
locking problem into a per-domain-object lock instead of a global lock
which is usually good enough and simpler to understand.

Jason
  
Niklas Schnelle Oct. 28, 2022, 9:29 a.m. UTC | #17
On Thu, 2022-10-27 at 11:03 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 27, 2022 at 03:35:57PM +0200, Niklas Schnelle wrote:
> > On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > > > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > > > 
> > > > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > > > anything else in the kernel, correct? At least that's how I understand
> > > > > > the synchronize_rcu() comments and the fact that e.g.
> > > > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > > > synchronize_rcu() and can be triggered from user-space too.
> > > > > 
> > > > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > > > >  
> > > > > > So we're
> > > > > > more worried about user-space getting slowed down rather than a Denial-
> > > > > > of-Service against other kernel tasks.
> > > > > 
> > > > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > > > you could be looking at several domains that have to be detached
> > > > > sequentially and with grace periods > 1s you can reach multiple
> > > > > seconds to complete something like a close() system call. Generally it
> > > > > should be weighed carefully
> > > > > 
> > > > > Jason
> > > > 
> > > > Thanks for the detailed explanation. Then let's not put a
> > > > synchronize_rcu() in detach, as I said as long as the I/O translation
> > > > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > > > result in an ignorable error. That said, I think if we don't have the
> > > > synchronize_rcu() in detach we need it in s390_domain_free() before
> > > > freeing the I/O translation tables.
> > > 
> > > Yes, it would be appropriate to free those using one of the rcu
> > > free'rs, (eg kfree_rcu) not synchronize_rcu()
> > > 
> > > Jason
> > 
> > They are allocated via kmem_cache_alloc() from caches shared by all
> > IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
> > entire I/O translation table of one IOMMU at once after it is not used
> > anymore. Before that it is only grown. So I think synchronize_rcu() is
> > the obvious and simple choice since we only need one grace period.
> 
> It has the same issue as doing it for the other reason, adding
> synchronize_rcu() to the domain free path is undesirable.
> 
> The best thing is to do as kfree_rcu() does now, basically:
> 
> rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
> if (!rcu_head)
>    synchronize_rcu()
> else
>    call_rcu(rcu_head)
> 
> And then call kmem_cache_free() from the rcu callback

Hmm, maybe a stupid question but why can't I just put the rcu_head in
struct s390_domain and then do a call_rcu() on that with a callback
that does:

	dma_cleanup_tables(s390_domain->dma_table);
	kfree(s390_domain);

I.e. the rest of the current s390_domain_free().
Then I don't have to worry about failing to allocate the rcu_head and
it's simple enough. Basically just do the actual freeing of the
s390_domain via call_rcu().

> 
> But this is getting very complicated, you might be better to refcount
> the domain itself and acquire the refcount under RCU. This turns the
> locking problem into a per-domain-object lock instead of a global lock
> which is usually good enough and simpler to understand.
> 
> Jason

Sorry I might be a bit slow as I'm new to RCU but I don't understand
this yet, especially the last part. Before this patch we do have a per-
domain lock but I'm sure that's not the kind of "per-domain-object
lock" you're talking about or else we wouldn't need RCU at all. Is this
maybe a different way of expressing the above idea using the analogy
with reference counting from whatisRCU.rst? Meaning we treat the fact
that there may still be RCU readers as "there are still references to
s390_domain"? 

Or do you mean to use a kref that is taken by RCU readers together with
rcu_read_lock() and dropped at rcu_read_unlock() such that during the
RCU read critical sections the refcount can't fall below 1 and the
domain is actually freed once we have a) put the initial reference
during s390_domain_free() and b) put all temporary references on
exiting the RCU read critical sections?
  
Jason Gunthorpe Oct. 28, 2022, 11:28 a.m. UTC | #18
On Fri, Oct 28, 2022 at 11:29:00AM +0200, Niklas Schnelle wrote:

> > rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
> > if (!rcu_head)
> >    synchronize_rcu()
> > else
> >    call_rcu(rcu_head)
> > 
> > And then call kmem_cache_free() from the rcu callback
> 
> Hmm, maybe a stupid question but why can't I just put the rcu_head in
> struct s390_domain and then do a call_rcu() on that with a callback
> that does:
> 
> 	dma_cleanup_tables(s390_domain->dma_table);
> 	kfree(s390_domain);
> 
> I.e. the rest of the current s390_domain_free().
> Then I don't have to worry about failing to allocate the rcu_head and
> it's simple enough. Basically just do the actual freeing of the
> s390_domain via call_rcu().

Oh, if you never reallocate the dma_table then yes that is a good idea

> Or do you mean to use a kref that is taken by RCU readers together with
> rcu_read_lock() and dropped at rcu_read_unlock() such that during the
> RCU read critical sections the refcount can't fall below 1 and the
> domain is actually freed once we have a) put the initial reference
> during s390_domain_free() and b) put all temporary references on
> exiting the RCU read critical sections?

Yes, this is a common pattern. Usually you want to optimize away the
global lock that protects, say, a linked list and then accept a local
lock/refcount inside the object

Jason
  

Patch

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 07361e2fd8c5..e4c3e4e04d30 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -119,6 +119,7 @@  struct zpci_dev {
 	struct list_head entry;		/* list of all zpci_devices, needed for hotplug, etc. */
 	struct list_head iommu_list;
 	struct kref kref;
+	struct rcu_head rcu;
 	struct hotplug_slot hotplug_slot;
 
 	enum zpci_state state;
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index a703dcd94a68..ef38b1514c77 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -996,7 +996,7 @@  void zpci_release_device(struct kref *kref)
 		break;
 	}
 	zpci_dbg(3, "rem fid:%x\n", zdev->fid);
-	kfree(zdev);
+	kfree_rcu(zdev, rcu);
 }
 
 int zpci_report_error(struct pci_dev *pdev,
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index a4c2e9bc6d83..4e90987be387 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -10,6 +10,8 @@ 
 #include <linux/iommu.h>
 #include <linux/iommu-helper.h>
 #include <linux/sizes.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
 #include <asm/pci_dma.h>
 
 static const struct iommu_ops s390_iommu_ops;
@@ -61,7 +63,7 @@  static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 
 	spin_lock_init(&s390_domain->dma_table_lock);
 	spin_lock_init(&s390_domain->list_lock);
-	INIT_LIST_HEAD(&s390_domain->devices);
+	INIT_LIST_HEAD_RCU(&s390_domain->devices);
 
 	return &s390_domain->domain;
 }
@@ -70,7 +72,9 @@  static void s390_domain_free(struct iommu_domain *domain)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 
+	rcu_read_lock();
 	WARN_ON(!list_empty(&s390_domain->devices));
+	rcu_read_unlock();
 	dma_cleanup_tables(s390_domain->dma_table);
 	kfree(s390_domain);
 }
@@ -84,7 +88,7 @@  static void __s390_iommu_detach_device(struct zpci_dev *zdev)
 		return;
 
 	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_del_init(&zdev->iommu_list);
+	list_del_rcu(&zdev->iommu_list);
 	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
 
 	zpci_unregister_ioat(zdev, 0);
@@ -127,7 +131,7 @@  static int s390_iommu_attach_device(struct iommu_domain *domain,
 	zdev->s390_domain = s390_domain;
 
 	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_add(&zdev->iommu_list, &s390_domain->devices);
+	list_add_rcu(&zdev->iommu_list, &s390_domain->devices);
 	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
 
 	return 0;
@@ -203,17 +207,16 @@  static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
 					zdev->end_dma - zdev->start_dma + 1);
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
@@ -222,21 +225,20 @@  static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	size_t size = gather->end - gather->start + 1;
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
 	/* If gather was never added to there is nothing to flush */
 	if (gather->start == ULONG_MAX)
 		return;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
 					size);
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
@@ -244,11 +246,10 @@  static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		if (!zdev->tlb_refresh)
 			continue;
 		rc = zpci_refresh_trans((u64)zdev->fh << 32,
@@ -256,7 +257,7 @@  static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static int s390_iommu_update_trans(struct s390_domain *s390_domain,