diff mbox series

[v7,12/12] iommu: Improve iopf_queue_flush_dev()

Message ID	20231115030226.16700-13-baolu.lu@linux.intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; From: Lu Baolu <baolu.lu@linux.intel.com> To: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Jason Gunthorpe <jgg@ziepe.ca>, Kevin Tian <kevin.tian@intel.com>, Jean-Philippe Brucker <jean-philippe@linaro.org>, Nicolin Chen <nicolinc@nvidia.com> Cc: Yi Liu <yi.l.liu@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, Yan Zhao <yan.y.zhao@intel.com>, iommu@lists.linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Lu Baolu <baolu.lu@linux.intel.com> Subject: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev() Date: Wed, 15 Nov 2023 11:02:26 +0800 Message-Id: <20231115030226.16700-13-baolu.lu@linux.intel.com> In-Reply-To: <20231115030226.16700-1-baolu.lu@linux.intel.com> References: <20231115030226.16700-1-baolu.lu@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	iommu: Prepare to deliver page faults to user space \| [v7,00/12] iommu: Prepare to deliver page faults to user space [v7,01/12] iommu: Move iommu fault data to linux/iommu.h [v7,02/12] iommu/arm-smmu-v3: Remove unrecoverable faults reporting [v7,03/12] iommu: Remove unrecoverable fault data [v7,04/12] iommu: Cleanup iopf data structure definitions [v7,05/12] iommu: Merge iopf_device_param into iommu_fault_param [v7,06/12] iommu: Remove iommu_[un]register_device_fault_handler() [v7,07/12] iommu: Merge iommu_fault_event and iopf_fault [v7,08/12] iommu: Prepare for separating SVA and IOPF [v7,09/12] iommu: Make iommu_queue_iopf() more generic [v7,10/12] iommu: Separate SVA and IOPF [v7,11/12] iommu: Consolidate per-device fault data management [v7,12/12] iommu: Improve iopf_queue_flush_dev()

Commit Message

Baolu Lu Nov. 15, 2023, 3:02 a.m. UTC

  The iopf_queue_flush_dev() is called by the iommu driver before releasing
a PASID. It ensures that all pending faults for this PASID have been
handled or cancelled, and won't hit the address space that reuses this
PASID. The driver must make sure that no new fault is added to the queue.

The SMMUv3 driver doesn't use it because it only implements the
Arm-specific stall fault model where DMA transactions are held in the SMMU
while waiting for the OS to handle iopf's. Since a device driver must
complete all DMA transactions before detaching domain, there are no
pending iopf's with the stall model. PRI support requires adding a call to
iopf_queue_flush_dev() after flushing the hardware page fault queue.

The current implementation of iopf_queue_flush_dev() is a simplified
version. It is only suitable for SVA case in which the processing of iopf
is implemented in the inner loop of the iommu subsystem.

Improve this interface to make it also work for handling iopf out of the
iommu core. Rename the function with a more meaningful name. Remove a
warning message in iommu_page_response() since the iopf queue might get
flushed before possible pending responses.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
---
 include/linux/iommu.h      |  4 +--
 drivers/iommu/intel/svm.c  |  2 +-
 drivers/iommu/io-pgfault.c | 60 ++++++++++++++++++++++++++++++--------
 3 files changed, 51 insertions(+), 15 deletions(-)

Comments

Jason Gunthorpe Dec. 1, 2023, 8:35 p.m. UTC | #1

On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
> The iopf_queue_flush_dev() is called by the iommu driver before releasing
> a PASID. It ensures that all pending faults for this PASID have been
> handled or cancelled, and won't hit the address space that reuses this
> PASID. The driver must make sure that no new fault is added to the queue.

This needs more explanation, why should anyone care?

More importantly, why is *discarding* the right thing to do?
Especially why would we discard a partial page request group?

After we change a translation we may have PRI requests in a
queue. They need to be acknowledged, not discarded. The DMA in the
device should be restarted and the device should observe the new
translation - if it is blocking then it should take a DMA error.

More broadly, we should just let things run their normal course. The
domain to deliver the fault to should be determined very early. If we
get a fault and there is no fault domain currently assigned then just
restart it.

The main reason to fence would be to allow the domain to become freed
as the faults should be holding pointers to it. But I feel there are
simpler options for that then this..

> The SMMUv3 driver doesn't use it because it only implements the
> Arm-specific stall fault model where DMA transactions are held in the SMMU
> while waiting for the OS to handle iopf's. Since a device driver must
> complete all DMA transactions before detaching domain, there are no
> pending iopf's with the stall model. PRI support requires adding a call to
> iopf_queue_flush_dev() after flushing the hardware page fault queue.

This explanation doesn't make much sense, from a device driver
perspective both PRI and stall cause the device to not complete DMAs.

The difference between stall and PRI is fairly small, stall causes an
internal bus to lock up while PRI does not.

> -int iopf_queue_flush_dev(struct device *dev)
> +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
>  {
>  	struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
> +	const struct iommu_ops *ops = dev_iommu_ops(dev);
> +	struct iommu_page_response resp;
> +	struct iopf_fault *iopf, *next;
> +	int ret = 0;
>  
>  	if (!iopf_param)
>  		return -ENODEV;
>  
>  	flush_workqueue(iopf_param->queue->wq);
> +

A naked flush_workqueue like this is really suspicious, it needs a
comment explaining why the queue can't get more work queued at this
point. 

I suppose the driver is expected to stop calling
iommu_report_device_fault() before calling this function, but that
doesn't seem like it is going to be possible. Drivers should be
implementing atomic replace for the PASID updates and in that case
there is no momement when it can say the HW will stop generating PRI.

I'm looking at this code after these patches are applied and it still
seems quite bonkers to me :(

Why do we allocate two copies of the memory on all fault paths?

Why do we have fault->type still that only has one value?

What is serializing iommu_get_domain_for_dev_pasid() in the fault
path? It looks sort of like the plan is to use iopf_param->lock and
ensure domain removal grabs that lock at least after the xarray is
changed - but does that actually happen?

I would suggest, broadly, a flow for iommu_report_device_fault() sort
of:

1) Allocate memory for the evt. Every path except errors needs this,
   so just do it
2) iopf_get_dev_fault_param() should not have locks in it! This is
   fast path now. Use a refcount, atomic compare exchange to allocate,
   and RCU free.
3) Everything runs under the fault_param->lock
4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then
   exit! This logic is really tortured and confusing
5) Allocate memory and assemble the group
6) Obtain the domain for this group and incr a per-domain counter that a
   fault is pending on that domain
7) Put the *group* into the WQ. Put the *group* on a list in fault_param
   instead of the individual faults
8) Don't linear search a linked list in iommu_page_response()! Pass
   the group in that we got from the WQ that we *know* is still
   active. Ack that passed group.

When freeing a domain wait for the per-domain counter to go to
zero. This ensures that the WQ is flushed out and all the outside
domain references are gone.

When wanting to turn off PRI make sure a non-PRI domain is
attached to everything. Fence against the HW's event queue. No new
iommu_report_device_fault() is possible.

Lock the fault_param->lock and go through every pending group and
respond it. Mark the group memory as invalid so iommu_page_response()
NOP's it. Unlock, fence the HW against queued responses, and turn off
PRI.

An *optimization* would be to lightly flush the domain when changing
the translation. Lock the fault_param->lock and look for groups in the
list with old_domain.  Do the same as for PRI-off: respond to the
group, mark it as NOP. The WQ may still be chewing on something so the
domain free still has to check and wait.

Did I get it right??

Jason

Baolu Lu Dec. 3, 2023, 8:53 a.m. UTC | #2

On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
> On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
>> The iopf_queue_flush_dev() is called by the iommu driver before releasing
>> a PASID. It ensures that all pending faults for this PASID have been
>> handled or cancelled, and won't hit the address space that reuses this
>> PASID. The driver must make sure that no new fault is added to the queue.
> This needs more explanation, why should anyone care?
> 
> More importantly, why is*discarding*  the right thing to do?
> Especially why would we discard a partial page request group?
> 
> After we change a translation we may have PRI requests in a
> queue. They need to be acknowledged, not discarded. The DMA in the
> device should be restarted and the device should observe the new
> translation - if it is blocking then it should take a DMA error.
> 
> More broadly, we should just let things run their normal course. The
> domain to deliver the fault to should be determined very early. If we
> get a fault and there is no fault domain currently assigned then just
> restart it.
> 
> The main reason to fence would be to allow the domain to become freed
> as the faults should be holding pointers to it. But I feel there are
> simpler options for that then this..

In the iommu_detach_device_pasid() path, the domain is about to be
removed from the pasid of device. The IOMMU driver performs the
following steps sequentially:

1. Clears the pasid translation entry. Thus, all subsequent DMA
    transactions (translation requests, translated requests or page
    requests) targeting the iommu domain will be blocked.

2. Waits until all pending page requests for the device's PASID have
    been reported to upper layers via the iommu_report_device_fault().
    However, this does not guarantee that all page requests have been
    responded.

3. Free all partial page requests for this pasid since the page request
    response is only needed for a complete request group. There's no
    action required for the page requests which are not last of a request
    group.

4. Iterate through the list of pending page requests and identifies
    those originating from the device's PASID. For each identified
    request, the driver responds to the hardware with the
    IOMMU_PAGE_RESP_INVALID code, indicating that the request cannot be
    handled and retries should not be attempted. This response code
    corresponds to the "Invalid Request" status defined in the PCI PRI
    specification.

5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
    section 7.10, Software Steps to Drain Page Requests & Responses) to
    drain in-flight page requests and page group responses between the
    remapping hardware queues and the endpoint device.

With above steps done in iommu_detach_device_pasid(), the pasid could be
re-used for any other address space.

The iopf_queue_discard_dev_pasid() helper does step 3 and 4.

> 
>> The SMMUv3 driver doesn't use it because it only implements the
>> Arm-specific stall fault model where DMA transactions are held in the SMMU
>> while waiting for the OS to handle iopf's. Since a device driver must
>> complete all DMA transactions before detaching domain, there are no
>> pending iopf's with the stall model. PRI support requires adding a call to
>> iopf_queue_flush_dev() after flushing the hardware page fault queue.
> This explanation doesn't make much sense, from a device driver
> perspective both PRI and stall cause the device to not complete DMAs.
> 
> The difference between stall and PRI is fairly small, stall causes an
> internal bus to lock up while PRI does not.
> 
>> -int iopf_queue_flush_dev(struct device *dev)
>> +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
>>   {
>>   	struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
>> +	const struct iommu_ops *ops = dev_iommu_ops(dev);
>> +	struct iommu_page_response resp;
>> +	struct iopf_fault *iopf, *next;
>> +	int ret = 0;
>>   
>>   	if (!iopf_param)
>>   		return -ENODEV;
>>   
>>   	flush_workqueue(iopf_param->queue->wq);
>> +
> A naked flush_workqueue like this is really suspicious, it needs a
> comment explaining why the queue can't get more work queued at this
> point.
> 
> I suppose the driver is expected to stop calling
> iommu_report_device_fault() before calling this function, but that
> doesn't seem like it is going to be possible. Drivers should be
> implementing atomic replace for the PASID updates and in that case
> there is no momement when it can say the HW will stop generating PRI.

Atomic domain replacement for a PASID is not currently implemented in
the core or driver. Even if atomic replacement were to be implemented,
it would be necessary to ensure that all translation requests,
translated requests, page requests and responses for the old domain are
drained before switching to the new domain. I am not sure whether the
existing iommu hardware architecture supports this functionality.

Best regards,
baolu

Jason Gunthorpe Dec. 3, 2023, 2:14 p.m. UTC | #3

On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
> > > The iopf_queue_flush_dev() is called by the iommu driver before releasing
> > > a PASID. It ensures that all pending faults for this PASID have been
> > > handled or cancelled, and won't hit the address space that reuses this
> > > PASID. The driver must make sure that no new fault is added to the queue.
> > This needs more explanation, why should anyone care?
> > 
> > More importantly, why is*discarding*  the right thing to do?
> > Especially why would we discard a partial page request group?
> > 
> > After we change a translation we may have PRI requests in a
> > queue. They need to be acknowledged, not discarded. The DMA in the
> > device should be restarted and the device should observe the new
> > translation - if it is blocking then it should take a DMA error.
> > 
> > More broadly, we should just let things run their normal course. The
> > domain to deliver the fault to should be determined very early. If we
> > get a fault and there is no fault domain currently assigned then just
> > restart it.
> > 
> > The main reason to fence would be to allow the domain to become freed
> > as the faults should be holding pointers to it. But I feel there are
> > simpler options for that then this..
> 
> In the iommu_detach_device_pasid() path, the domain is about to be
> removed from the pasid of device. The IOMMU driver performs the
> following steps sequentially:

I know that is why it does, but it doesn't explain at all why.

> 1. Clears the pasid translation entry. Thus, all subsequent DMA
>    transactions (translation requests, translated requests or page
>    requests) targeting the iommu domain will be blocked.
> 
> 2. Waits until all pending page requests for the device's PASID have
>    been reported to upper layers via the iommu_report_device_fault().
>    However, this does not guarantee that all page requests have been
>    responded.
>
> 3. Free all partial page requests for this pasid since the page request
>    response is only needed for a complete request group. There's no
>    action required for the page requests which are not last of a request
>    group.

But we expect the last to come eventually since everything should be
grouped properly, so why bother doing this?

Indeed if 2 worked, how is this even possible to have partials?
 
> 5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
>    section 7.10, Software Steps to Drain Page Requests & Responses) to
>    drain in-flight page requests and page group responses between the
>    remapping hardware queues and the endpoint device.
> 
> With above steps done in iommu_detach_device_pasid(), the pasid could be
> re-used for any other address space.

As I said, that isn't even required. There is no issue with leaking
PRI's across attachments.


> > I suppose the driver is expected to stop calling
> > iommu_report_device_fault() before calling this function, but that
> > doesn't seem like it is going to be possible. Drivers should be
> > implementing atomic replace for the PASID updates and in that case
> > there is no momement when it can say the HW will stop generating PRI.
> 
> Atomic domain replacement for a PASID is not currently implemented in
> the core or driver. 

It is, the driver should implement set_dev_pasid in such a way that
repeated calls do replacements, ideally atomically. This is what ARM
SMMUv3 does after my changes.

> Even if atomic replacement were to be implemented,
> it would be necessary to ensure that all translation requests,
> translated requests, page requests and responses for the old domain are
> drained before switching to the new domain. 

Again, no it isn't required.

Requests simply have to continue to be acked, it doesn't matter if
they are acked against the wrong domain because the device will simply
re-issue them..

Jason

Baolu Lu Dec. 4, 2023, 1:32 a.m. UTC | #4

On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
>> On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
>>> On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
>>>> The iopf_queue_flush_dev() is called by the iommu driver before releasing
>>>> a PASID. It ensures that all pending faults for this PASID have been
>>>> handled or cancelled, and won't hit the address space that reuses this
>>>> PASID. The driver must make sure that no new fault is added to the queue.
>>> This needs more explanation, why should anyone care?
>>>
>>> More importantly, why is*discarding*  the right thing to do?
>>> Especially why would we discard a partial page request group?
>>>
>>> After we change a translation we may have PRI requests in a
>>> queue. They need to be acknowledged, not discarded. The DMA in the
>>> device should be restarted and the device should observe the new
>>> translation - if it is blocking then it should take a DMA error.
>>>
>>> More broadly, we should just let things run their normal course. The
>>> domain to deliver the fault to should be determined very early. If we
>>> get a fault and there is no fault domain currently assigned then just
>>> restart it.
>>>
>>> The main reason to fence would be to allow the domain to become freed
>>> as the faults should be holding pointers to it. But I feel there are
>>> simpler options for that then this..
>>
>> In the iommu_detach_device_pasid() path, the domain is about to be
>> removed from the pasid of device. The IOMMU driver performs the
>> following steps sequentially:
> 
> I know that is why it does, but it doesn't explain at all why.
> 
>> 1. Clears the pasid translation entry. Thus, all subsequent DMA
>>     transactions (translation requests, translated requests or page
>>     requests) targeting the iommu domain will be blocked.
>>
>> 2. Waits until all pending page requests for the device's PASID have
>>     been reported to upper layers via the iommu_report_device_fault().
>>     However, this does not guarantee that all page requests have been
>>     responded.
>>
>> 3. Free all partial page requests for this pasid since the page request
>>     response is only needed for a complete request group. There's no
>>     action required for the page requests which are not last of a request
>>     group.
> 
> But we expect the last to come eventually since everything should be
> grouped properly, so why bother doing this?
> 
> Indeed if 2 worked, how is this even possible to have partials?

Step 1 clears the pasid table entry, hence all subsequent page requests
are blocked (hardware auto-respond the request but not put it in the
queue).

It is possible that a portion of a page fault group may have been queued
for processing, but the last request is being blocked by hardware due
to the pasid entry being in the blocking state.

In reality, this may be a no-op as I haven't seen any real-world
implementations of multiple-requests fault groups on Intel platforms.

>   
>> 5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
>>     section 7.10, Software Steps to Drain Page Requests & Responses) to
>>     drain in-flight page requests and page group responses between the
>>     remapping hardware queues and the endpoint device.
>>
>> With above steps done in iommu_detach_device_pasid(), the pasid could be
>> re-used for any other address space.
> 
> As I said, that isn't even required. There is no issue with leaking
> PRI's across attachments.
> 
> 
>>> I suppose the driver is expected to stop calling
>>> iommu_report_device_fault() before calling this function, but that
>>> doesn't seem like it is going to be possible. Drivers should be
>>> implementing atomic replace for the PASID updates and in that case
>>> there is no momement when it can say the HW will stop generating PRI.
>>
>> Atomic domain replacement for a PASID is not currently implemented in
>> the core or driver.
> 
> It is, the driver should implement set_dev_pasid in such a way that
> repeated calls do replacements, ideally atomically. This is what ARM
> SMMUv3 does after my changes.
> 
>> Even if atomic replacement were to be implemented,
>> it would be necessary to ensure that all translation requests,
>> translated requests, page requests and responses for the old domain are
>> drained before switching to the new domain.
> 
> Again, no it isn't required.
> 
> Requests simply have to continue to be acked, it doesn't matter if
> they are acked against the wrong domain because the device will simply
> re-issue them..

Ah! I start to get your point now.

Even a page fault response is postponed to a new address space, which
possibly be another address space or hardware blocking state, the
hardware just retries.

As long as we flushes all caches (IOTLB and device TLB) during 
switching, the mappings of the old domain won't leak. So it's safe to 
keep page requests there.

Do I get you correctly?

Best regards,
baolu

Baolu Lu Dec. 4, 2023, 3:46 a.m. UTC | #5

On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
> I'm looking at this code after these patches are applied and it still
> seems quite bonkers to me 🙁
> 
> Why do we allocate two copies of the memory on all fault paths?
> 
> Why do we have fault->type still that only has one value?
> 
> What is serializing iommu_get_domain_for_dev_pasid() in the fault
> path? It looks sort of like the plan is to use iopf_param->lock and
> ensure domain removal grabs that lock at least after the xarray is
> changed - but does that actually happen?
> 
> I would suggest, broadly, a flow for iommu_report_device_fault() sort
> of:
> 
> 1) Allocate memory for the evt. Every path except errors needs this,
>     so just do it
> 2) iopf_get_dev_fault_param() should not have locks in it! This is
>     fast path now. Use a refcount, atomic compare exchange to allocate,
>     and RCU free.
> 3) Everything runs under the fault_param->lock
> 4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then
>     exit! This logic is really tortured and confusing
> 5) Allocate memory and assemble the group
> 6) Obtain the domain for this group and incr a per-domain counter that a
>     fault is pending on that domain
> 7) Put the*group*  into the WQ. Put the*group*  on a list in fault_param
>     instead of the individual faults
> 8) Don't linear search a linked list in iommu_page_response()! Pass
>     the group in that we got from the WQ that we*know*  is still
>     active. Ack that passed group.
> 
> When freeing a domain wait for the per-domain counter to go to
> zero. This ensures that the WQ is flushed out and all the outside
> domain references are gone.
> 
> When wanting to turn off PRI make sure a non-PRI domain is
> attached to everything. Fence against the HW's event queue. No new
> iommu_report_device_fault() is possible.
> 
> Lock the fault_param->lock and go through every pending group and
> respond it. Mark the group memory as invalid so iommu_page_response()
> NOP's it. Unlock, fence the HW against queued responses, and turn off
> PRI.
> 
> An*optimization*  would be to lightly flush the domain when changing
> the translation. Lock the fault_param->lock and look for groups in the
> list with old_domain.  Do the same as for PRI-off: respond to the
> group, mark it as NOP. The WQ may still be chewing on something so the
> domain free still has to check and wait.

Very appreciated for all the ideas. I looked through the items and felt
that all these are good optimizations.

I am wondering whether we can take patch 1/12 ~ 10/12 of this series as
a first step, a refactoring effort to support delivering iopf to
userspace? I will follow up with one or multiple series to add the
optimizations.

Does this work for you? Or, you want to take any of above as the
requirement for iommufd use case?

Best regards,
baolu

Tian, Kevin Dec. 4, 2023, 5:37 a.m. UTC | #6

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Monday, December 4, 2023 9:33 AM
> 
> On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> >> Even if atomic replacement were to be implemented,
> >> it would be necessary to ensure that all translation requests,
> >> translated requests, page requests and responses for the old domain are
> >> drained before switching to the new domain.
> >
> > Again, no it isn't required.
> >
> > Requests simply have to continue to be acked, it doesn't matter if
> > they are acked against the wrong domain because the device will simply
> > re-issue them..
> 
> Ah! I start to get your point now.
> 
> Even a page fault response is postponed to a new address space, which
> possibly be another address space or hardware blocking state, the
> hardware just retries.

if blocking then the device shouldn't retry.

btw if a stale request targets an virtual address which is outside of the
valid VMA's of the new address space then visible side-effect will
be incurred in handle_mm_fault() on the new space. Is it desired?

Or if a pending response carries an error code (Invalid Request) from
the old address space is received by the device when the new address
space is already activated, the hardware will report an error even
though there might be a valid mapping in the new space.

> 
> As long as we flushes all caches (IOTLB and device TLB) during
> switching, the mappings of the old domain won't leak. So it's safe to
> keep page requests there.
> 

I don't think atomic replace is the main usage for this draining 
requirement. Instead I'm more interested in the basic popular usage: 
attach-detach-attach and not convinced that no draining is required
between iommu/device to avoid interference between activities
from old/new address space.

Jason Gunthorpe Dec. 4, 2023, 1:12 p.m. UTC | #7

On Mon, Dec 04, 2023 at 09:32:37AM +0800, Baolu Lu wrote:

> > 
> > I know that is why it does, but it doesn't explain at all why.
> > 
> > > 1. Clears the pasid translation entry. Thus, all subsequent DMA
> > >     transactions (translation requests, translated requests or page
> > >     requests) targeting the iommu domain will be blocked.
> > > 
> > > 2. Waits until all pending page requests for the device's PASID have
> > >     been reported to upper layers via the iommu_report_device_fault().
> > >     However, this does not guarantee that all page requests have been
> > >     responded.
> > > 
> > > 3. Free all partial page requests for this pasid since the page request
> > >     response is only needed for a complete request group. There's no
> > >     action required for the page requests which are not last of a request
> > >     group.
> > 
> > But we expect the last to come eventually since everything should be
> > grouped properly, so why bother doing this?
> > 
> > Indeed if 2 worked, how is this even possible to have partials?
> 
> Step 1 clears the pasid table entry, hence all subsequent page requests
> are blocked (hardware auto-respond the request but not put it in the
> queue).

OK, that part makes sense, but it should be clearly documented that is
why this stuff is going on with the partial list. 

"We have to clear the parial list as the new domain may not generate a
SW visible LAST. If it does generate a SW visible last then we simply
incompletely fault it and restart the device which will fix things on
retry"

> > Requests simply have to continue to be acked, it doesn't matter if
> > they are acked against the wrong domain because the device will simply
> > re-issue them..
> 
> Ah! I start to get your point now.
> 
> Even a page fault response is postponed to a new address space, which
> possibly be another address space or hardware blocking state, the
> hardware just retries.
> 
> As long as we flushes all caches (IOTLB and device TLB) during switching,
> the mappings of the old domain won't leak. So it's safe to keep page
> requests there.
> 
> Do I get you correctly?

Yes

It seems much simpler to me than trying to make this synchronous and
it is compatible with hitless replace of a PASID.

The lifetime and locking rules are also far more understandable

Jason

Jason Gunthorpe Dec. 4, 2023, 1:25 p.m. UTC | #8

On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > From: Baolu Lu <baolu.lu@linux.intel.com>
> > Sent: Monday, December 4, 2023 9:33 AM
> > 
> > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > >> Even if atomic replacement were to be implemented,
> > >> it would be necessary to ensure that all translation requests,
> > >> translated requests, page requests and responses for the old domain are
> > >> drained before switching to the new domain.
> > >
> > > Again, no it isn't required.
> > >
> > > Requests simply have to continue to be acked, it doesn't matter if
> > > they are acked against the wrong domain because the device will simply
> > > re-issue them..
> > 
> > Ah! I start to get your point now.
> > 
> > Even a page fault response is postponed to a new address space, which
> > possibly be another address space or hardware blocking state, the
> > hardware just retries.
> 
> if blocking then the device shouldn't retry.

It does retry.

The device is waiting on a PRI, it gets back an completion. It issues
a new ATS (this is the rety) and the new-domain responds back with a
failure indication.

If the new domain had a present page it would respond with a
translation

If the new domain has a non-present page then we get a new PRI.

The point is from a device perspective it is always doing something
correct.

> btw if a stale request targets an virtual address which is outside of the
> valid VMA's of the new address space then visible side-effect will
> be incurred in handle_mm_fault() on the new space. Is it desired?

The whole thing is racy, if someone is radically changing the
underlying mappings while DMA is ongoing then there is no way to
synchronize 'before' and 'after' against a concurrent external device.

So who cares?

What we care about is that the ATC is coherent and never has stale
data. The invalidation after changing the translation ensures this
regardless of any outstanding un-acked PRI.

> Or if a pending response carries an error code (Invalid Request) from
> the old address space is received by the device when the new address
> space is already activated, the hardware will report an error even
> though there might be a valid mapping in the new space.

Again, all racy. If a DMA is ongoing at the same instant things are
changed there is no definitive way to say if it resolved before or
after.

The only thing we care about is that dmas that are completed before
see the before translation and dmas that are started after see the
after translation.

DMAs that cross choose one at random.

> I don't think atomic replace is the main usage for this draining 
> requirement. Instead I'm more interested in the basic popular usage: 
> attach-detach-attach and not convinced that no draining is required
> between iommu/device to avoid interference between activities
> from old/new address space.

Something like IDXD needs to halt DMAs on the PASID and flush all
outstanding DMA to get to a state where the PASID is quiet from the
device perspective. This is the only way to stop interference.

If the device is still issuing DMA after the domain changes then it is
never going to work right.

If *IDXD* needs some help to flush PRIs after it halts DMAs (because
it can't do it on its own for some reason) then IDXD should have an
explicit call to do that, after suspending new DMA.

We don't know what things devices will need to do here, devices that
are able to wait for PRIs to complete may want a cancelling flush to
speed that up, and that shouldn't be part of the translation change.

IOW the act of halting DMA and the act of changing the translation
really should be different things. Then we get into interesting
questions like what sequence is required for a successful FLR. :\

Jason

Jason Gunthorpe Dec. 4, 2023, 1:27 p.m. UTC | #9

On Mon, Dec 04, 2023 at 11:46:30AM +0800, Baolu Lu wrote:
> On 12/2/23 4:35 AM, Jason Gunthorpe wrote:

> I am wondering whether we can take patch 1/12 ~ 10/12 of this series as
> a first step, a refactoring effort to support delivering iopf to
> userspace? I will follow up with one or multiple series to add the
> optimizations.

I think that is reasonable, though I would change the earlier patch to
use RCU to obtain the fault data.

Jason

Baolu Lu Dec. 5, 2023, 1:13 a.m. UTC | #10

On 12/4/23 9:27 PM, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 11:46:30AM +0800, Baolu Lu wrote:
>> On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
>> I am wondering whether we can take patch 1/12 ~ 10/12 of this series as
>> a first step, a refactoring effort to support delivering iopf to
>> userspace? I will follow up with one or multiple series to add the
>> optimizations.
> I think that is reasonable, though I would change the earlier patch to
> use RCU to obtain the fault data.

All right! I will do this in the updated version.

Best regards,
baolu

Tian, Kevin Dec. 5, 2023, 1:32 a.m. UTC | #11

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Monday, December 4, 2023 9:25 PM
> 
> On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > From: Baolu Lu <baolu.lu@linux.intel.com>
> > > Sent: Monday, December 4, 2023 9:33 AM
> > >
> > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > >> Even if atomic replacement were to be implemented,
> > > >> it would be necessary to ensure that all translation requests,
> > > >> translated requests, page requests and responses for the old domain
> are
> > > >> drained before switching to the new domain.
> > > >
> > > > Again, no it isn't required.
> > > >
> > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > they are acked against the wrong domain because the device will simply
> > > > re-issue them..
> > >
> > > Ah! I start to get your point now.
> > >
> > > Even a page fault response is postponed to a new address space, which
> > > possibly be another address space or hardware blocking state, the
> > > hardware just retries.
> >
> > if blocking then the device shouldn't retry.
> 
> It does retry.
> 
> The device is waiting on a PRI, it gets back an completion. It issues
> a new ATS (this is the rety) and the new-domain responds back with a
> failure indication.

I'm not sure that is the standard behavior defined by PCIe spec.

According to "10.4.2 Page Request Group Response Message", function's
response to Page Request failure is implementation specific.

so a new ATS is optional and likely the device will instead abort the DMA
if PRI response already indicates a failure.

> 
> If the new domain had a present page it would respond with a
> translation
> 
> If the new domain has a non-present page then we get a new PRI.
> 
> The point is from a device perspective it is always doing something
> correct.
> 
> > btw if a stale request targets an virtual address which is outside of the
> > valid VMA's of the new address space then visible side-effect will
> > be incurred in handle_mm_fault() on the new space. Is it desired?
> 
> The whole thing is racy, if someone is radically changing the
> underlying mappings while DMA is ongoing then there is no way to
> synchronize 'before' and 'after' against a concurrent external device.
> 
> So who cares?
> 
> What we care about is that the ATC is coherent and never has stale
> data. The invalidation after changing the translation ensures this
> regardless of any outstanding un-acked PRI.
> 
> > Or if a pending response carries an error code (Invalid Request) from
> > the old address space is received by the device when the new address
> > space is already activated, the hardware will report an error even
> > though there might be a valid mapping in the new space.
> 
> Again, all racy. If a DMA is ongoing at the same instant things are
> changed there is no definitive way to say if it resolved before or
> after.
> 
> The only thing we care about is that dmas that are completed before
> see the before translation and dmas that are started after see the
> after translation.
> 
> DMAs that cross choose one at random.

Yes that makes sense for replacement.

But here we are talking about a draining requirement when disabling
a pasid entry, which is certainly not involved in replacement.

> 
> > I don't think atomic replace is the main usage for this draining
> > requirement. Instead I'm more interested in the basic popular usage:
> > attach-detach-attach and not convinced that no draining is required
> > between iommu/device to avoid interference between activities
> > from old/new address space.
> 
> Something like IDXD needs to halt DMAs on the PASID and flush all
> outstanding DMA to get to a state where the PASID is quiet from the
> device perspective. This is the only way to stop interference.

why is it IDXD specific behavior? I suppose all devices need to quiesce
the outstanding DMAs when tearing down the binding between the
PASID and previous address space.

and here what you described is the normal behavior. In this case
I agree that no draining is required in iommu side given the device
should have quiesced all outstanding DMAs including page requests.

but there are also terminal conditions e.g. when a workqueue is
reset after hang hence additional draining is required from the 
iommu side to ensure all the outstanding page requests/responses
are properly handled.

vt-d spec defines a draining process to cope with those terminal
conditions (see 7.9 Pending Page Request Handling on Terminal
Conditions). intel-iommu driver just implements it by default for
simplicity (one may consider providing explicit API for drivers to
call but not sure of the necessity if such terminal conditions
apply to most devices). anyway this is not a fast path.

another example might be stop marker. A device using stop marker
doesn't need to wait for outstanding page requests. According to PCIe
spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
simply marks outstanding page request as stale and sends a stop
marker message to the IOMMU. Page responses for those stale
requests are ignored. But presumably the iommu driver still needs
to drain those requests until the stop marker message in unbind
to avoid them incorrectly routed to a new address space in case the
PASID is rebound to another process immediately.

> 
> If the device is still issuing DMA after the domain changes then it is
> never going to work right.
> 
> If *IDXD* needs some help to flush PRIs after it halts DMAs (because
> it can't do it on its own for some reason) then IDXD should have an
> explicit call to do that, after suspending new DMA.

as above I don't think IDXD itself has any special requirements. We
are discussing general device terminal conditions which are considered
by the iommu spec.

> 
> We don't know what things devices will need to do here, devices that
> are able to wait for PRIs to complete may want a cancelling flush to
> speed that up, and that shouldn't be part of the translation change.
> 
> IOW the act of halting DMA and the act of changing the translation
> really should be different things. Then we get into interesting
> questions like what sequence is required for a successful FLR. :\
> 
> Jason

Jason Gunthorpe Dec. 5, 2023, 1:53 a.m. UTC | #12

On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Monday, December 4, 2023 9:25 PM
> > 
> > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > > From: Baolu Lu <baolu.lu@linux.intel.com>
> > > > Sent: Monday, December 4, 2023 9:33 AM
> > > >
> > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > > >> Even if atomic replacement were to be implemented,
> > > > >> it would be necessary to ensure that all translation requests,
> > > > >> translated requests, page requests and responses for the old domain
> > are
> > > > >> drained before switching to the new domain.
> > > > >
> > > > > Again, no it isn't required.
> > > > >
> > > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > > they are acked against the wrong domain because the device will simply
> > > > > re-issue them..
> > > >
> > > > Ah! I start to get your point now.
> > > >
> > > > Even a page fault response is postponed to a new address space, which
> > > > possibly be another address space or hardware blocking state, the
> > > > hardware just retries.
> > >
> > > if blocking then the device shouldn't retry.
> > 
> > It does retry.
> > 
> > The device is waiting on a PRI, it gets back an completion. It issues
> > a new ATS (this is the rety) and the new-domain responds back with a
> > failure indication.
> 
> I'm not sure that is the standard behavior defined by PCIe spec.

> According to "10.4.2 Page Request Group Response Message", function's
> response to Page Request failure is implementation specific.
> 
> so a new ATS is optional and likely the device will instead abort the DMA
> if PRI response already indicates a failure.

I didn't said the PRI would fail, I said the ATS would fail with a
non-present.

It has to work this way or it is completely broken with respect to
existing races in the mm side. Agents must retry non-present ATS
answers until you get a present or a ATS failure.

> > Again, all racy. If a DMA is ongoing at the same instant things are
> > changed there is no definitive way to say if it resolved before or
> > after.
> > 
> > The only thing we care about is that dmas that are completed before
> > see the before translation and dmas that are started after see the
> > after translation.
> > 
> > DMAs that cross choose one at random.
> 
> Yes that makes sense for replacement.
> 
> But here we are talking about a draining requirement when disabling
> a pasid entry, which is certainly not involved in replacement.

It is the same argument, you are replacing a PTE that was non-present
with one that is failing/blocking - the result of a DMA that crosses
this event can be either.

> > > I don't think atomic replace is the main usage for this draining
> > > requirement. Instead I'm more interested in the basic popular usage:
> > > attach-detach-attach and not convinced that no draining is required
> > > between iommu/device to avoid interference between activities
> > > from old/new address space.
> > 
> > Something like IDXD needs to halt DMAs on the PASID and flush all
> > outstanding DMA to get to a state where the PASID is quiet from the
> > device perspective. This is the only way to stop interference.
> 
> why is it IDXD specific behavior? I suppose all devices need to quiesce
> the outstanding DMAs when tearing down the binding between the
> PASID and previous address space.

Because it is so simple HW I assume this is why this code is being
pushed here :)

> but there are also terminal conditions e.g. when a workqueue is
> reset after hang hence additional draining is required from the 
> iommu side to ensure all the outstanding page requests/responses
> are properly handled.

Then it should be coded as an explicit drain request from device when
and where they need it.

It should not be integrated into the iommu side because it is
nonsensical. Devices expecting consistent behavior must stop DMA
before changing translation, and if they need help to do it they must
call APIs. Changing translation is not required after a so called
"terminal event".

> vt-d spec defines a draining process to cope with those terminal
> conditions (see 7.9 Pending Page Request Handling on Terminal
> Conditions). intel-iommu driver just implements it by default for
> simplicity (one may consider providing explicit API for drivers to
> call but not sure of the necessity if such terminal conditions
> apply to most devices). anyway this is not a fast path.

It is not "by default" it is in the wrong place. These terminal
conditions are things like FLR. FLR has nothing to do with changing
the translation. I can trigger FLR and keep the current translation
and still would want to flush out all the PRIs before starting DMA
again to avoid protocol confusion.

An API is absolutely necessary. Confusing the cases that need draining
with translation change is just not logically right.

eg we do need to modify VFIO to do the drain on FLR like the spec
explains!

Draining has to be ordered correctly with whatever the device is
doing. Drain needs to come after FLR, for instance. It needs to come
after a work queue reset, because drain doesn't make any sense unless
it is coupled with a DMA stop at the device.

Hacking a DMA stop by forcing a blocking translation is not logically
correct, with wrong ordering the device may see unexpected translation
failures which may trigger AERs or bad things..

> another example might be stop marker. A device using stop marker
> doesn't need to wait for outstanding page requests. According to PCIe
> spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
> simply marks outstanding page request as stale and sends a stop
> marker message to the IOMMU. Page responses for those stale
> requests are ignored. But presumably the iommu driver still needs
> to drain those requests until the stop marker message in unbind
> to avoid them incorrectly routed to a new address space in case the
> PASID is rebound to another process immediately.

Stop marker doesn't change anything, in all processing it just removes
requests that have yet to complete. If a device is using stop then
most likely the whole thing is racy and the OS simply has to be ready
to handle stop at any time.

Jason

Tian, Kevin Dec. 5, 2023, 3:23 a.m. UTC | #13

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, December 5, 2023 9:53 AM
> 
> On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Monday, December 4, 2023 9:25 PM
> > >
> > > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > > > From: Baolu Lu <baolu.lu@linux.intel.com>
> > > > > Sent: Monday, December 4, 2023 9:33 AM
> > > > >
> > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > > > >> Even if atomic replacement were to be implemented,
> > > > > >> it would be necessary to ensure that all translation requests,
> > > > > >> translated requests, page requests and responses for the old
> domain
> > > are
> > > > > >> drained before switching to the new domain.
> > > > > >
> > > > > > Again, no it isn't required.
> > > > > >
> > > > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > > > they are acked against the wrong domain because the device will
> simply
> > > > > > re-issue them..
> > > > >
> > > > > Ah! I start to get your point now.
> > > > >
> > > > > Even a page fault response is postponed to a new address space,
> which
> > > > > possibly be another address space or hardware blocking state, the
> > > > > hardware just retries.
> > > >
> > > > if blocking then the device shouldn't retry.
> > >
> > > It does retry.
> > >
> > > The device is waiting on a PRI, it gets back an completion. It issues
> > > a new ATS (this is the rety) and the new-domain responds back with a
> > > failure indication.
> >
> > I'm not sure that is the standard behavior defined by PCIe spec.
> 
> > According to "10.4.2 Page Request Group Response Message", function's
> > response to Page Request failure is implementation specific.
> >
> > so a new ATS is optional and likely the device will instead abort the DMA
> > if PRI response already indicates a failure.
> 
> I didn't said the PRI would fail, I said the ATS would fail with a
> non-present.
> 
> It has to work this way or it is completely broken with respect to
> existing races in the mm side. Agents must retry non-present ATS
> answers until you get a present or a ATS failure.

My understanding of the sequence is like below:

<'D' for device, 'I' for IOMMU>

  (D) send a ATS translation request
  (I) respond translation result
  (D) If success then sends DMA to the target page
      otherwise send a PRI request
        (I) raise an IOMMU interrupt allowing sw to fix the translation
        (I) generate a PRI response to device
        (D) if success then jump to the first step to retry
            otherwise abort the current request

If mm changes the mapping after a success PRI response, mmu notifier
callback in iommu driver needs to wait for device tlb invalidation completion
which the device will order properly with outstanding DMA requests using
the old translation.

If you refer to the 'retry' after receiving a success PRI response, then yes.

but there is really no reason to retry upon a PRI response failure which
indicates that the faulting address is not a valid one which OS would like
to fix.

> 
> > > Again, all racy. If a DMA is ongoing at the same instant things are
> > > changed there is no definitive way to say if it resolved before or
> > > after.
> > >
> > > The only thing we care about is that dmas that are completed before
> > > see the before translation and dmas that are started after see the
> > > after translation.
> > >
> > > DMAs that cross choose one at random.
> >
> > Yes that makes sense for replacement.
> >
> > But here we are talking about a draining requirement when disabling
> > a pasid entry, which is certainly not involved in replacement.
> 
> It is the same argument, you are replacing a PTE that was non-present

s/non-present/present/

> with one that is failing/blocking - the result of a DMA that crosses
> this event can be either.

kind of

> 
> > > > I don't think atomic replace is the main usage for this draining
> > > > requirement. Instead I'm more interested in the basic popular usage:
> > > > attach-detach-attach and not convinced that no draining is required
> > > > between iommu/device to avoid interference between activities
> > > > from old/new address space.
> > >
> > > Something like IDXD needs to halt DMAs on the PASID and flush all
> > > outstanding DMA to get to a state where the PASID is quiet from the
> > > device perspective. This is the only way to stop interference.
> >
> > why is it IDXD specific behavior? I suppose all devices need to quiesce
> > the outstanding DMAs when tearing down the binding between the
> > PASID and previous address space.
> 
> Because it is so simple HW I assume this is why this code is being
> pushed here :)
> 
> > but there are also terminal conditions e.g. when a workqueue is
> > reset after hang hence additional draining is required from the
> > iommu side to ensure all the outstanding page requests/responses
> > are properly handled.
> 
> Then it should be coded as an explicit drain request from device when
> and where they need it.
> 
> It should not be integrated into the iommu side because it is
> nonsensical. Devices expecting consistent behavior must stop DMA
> before changing translation, and if they need help to do it they must
> call APIs. Changing translation is not required after a so called
> "terminal event".
> 
> > vt-d spec defines a draining process to cope with those terminal
> > conditions (see 7.9 Pending Page Request Handling on Terminal
> > Conditions). intel-iommu driver just implements it by default for
> > simplicity (one may consider providing explicit API for drivers to
> > call but not sure of the necessity if such terminal conditions
> > apply to most devices). anyway this is not a fast path.
> 
> It is not "by default" it is in the wrong place. These terminal
> conditions are things like FLR. FLR has nothing to do with changing
> the translation. I can trigger FLR and keep the current translation
> and still would want to flush out all the PRIs before starting DMA
> again to avoid protocol confusion.
> 
> An API is absolutely necessary. Confusing the cases that need draining
> with translation change is just not logically right.
> 
> eg we do need to modify VFIO to do the drain on FLR like the spec
> explains!
> 
> Draining has to be ordered correctly with whatever the device is
> doing. Drain needs to come after FLR, for instance. It needs to come
> after a work queue reset, because drain doesn't make any sense unless
> it is coupled with a DMA stop at the device.

Okay that makes sense. As Baolu and you already agreed let's separate
this fix out of this series.

The minor interesting aspect is how to document this requirement
clearly so drivers won't skip calling it when sva is enabled. 

> 
> Hacking a DMA stop by forcing a blocking translation is not logically
> correct, with wrong ordering the device may see unexpected translation
> failures which may trigger AERs or bad things..

where is such hack? though the current implementation of draining
is not clean, it's put inside pasid-disable-sequence instead of forcing
a blocking translation implicitly in iommu driver i.e. it's still the driver
making decision for what translation to be used...

> 
> > another example might be stop marker. A device using stop marker
> > doesn't need to wait for outstanding page requests. According to PCIe
> > spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
> > simply marks outstanding page request as stale and sends a stop
> > marker message to the IOMMU. Page responses for those stale
> > requests are ignored. But presumably the iommu driver still needs
> > to drain those requests until the stop marker message in unbind
> > to avoid them incorrectly routed to a new address space in case the
> > PASID is rebound to another process immediately.
> 
> Stop marker doesn't change anything, in all processing it just removes
> requests that have yet to complete. If a device is using stop then
> most likely the whole thing is racy and the OS simply has to be ready
> to handle stop at any time.
> 

I'm not sure whether a device supporting stop marker may provide
certain abort-dma cmd to not wait. But guess such abort semantics
will be likely used in terminal condition too so the same argument
still hold. Presumably normal translation change should always use
a stop-wait semantics for outstanding DMAs. 😊

Jason Gunthorpe Dec. 5, 2023, 3:52 p.m. UTC | #14

On Tue, Dec 05, 2023 at 03:23:05AM +0000, Tian, Kevin wrote:

> > I didn't said the PRI would fail, I said the ATS would fail with a
> > non-present.
> > 
> > It has to work this way or it is completely broken with respect to
> > existing races in the mm side. Agents must retry non-present ATS
> > answers until you get a present or a ATS failure.
> 
> My understanding of the sequence is like below:
> 
> <'D' for device, 'I' for IOMMU>
> 
>   (D) send a ATS translation request
>   (I) respond translation result
>   (D) If success then sends DMA to the target page
>       otherwise send a PRI request
>         (I) raise an IOMMU interrupt allowing sw to fix the translation
>         (I) generate a PRI response to device
>         (D) if success then jump to the first step to retry
>             otherwise abort the current request
> 
> If mm changes the mapping after a success PRI response, mmu notifier
> callback in iommu driver needs to wait for device tlb invalidation completion
> which the device will order properly with outstanding DMA requests using
> the old translation.
> 
> If you refer to the 'retry' after receiving a success PRI response, then yes.
> 
> but there is really no reason to retry upon a PRI response failure which
> indicates that the faulting address is not a valid one which OS would like
> to fix.

Right

> > Draining has to be ordered correctly with whatever the device is
> > doing. Drain needs to come after FLR, for instance. It needs to come
> > after a work queue reset, because drain doesn't make any sense unless
> > it is coupled with a DMA stop at the device.
> 
> Okay that makes sense. As Baolu and you already agreed let's separate
> this fix out of this series.
> 
> The minor interesting aspect is how to document this requirement
> clearly so drivers won't skip calling it when sva is enabled. 

All changes to translation inside kernel drivers should only be done
once the DMA is halted, otherwise things possibily become security
troubled.

We should document this clearly, it is already expected in common
cases like using the DMA API and when removing() drivers. It also
applies when the driver is manually changing a PASID.

The issue is not drain, it is that the HW is still doing DMA on the
PASID and the PASID may be assigned to a new process. This kernel
*must* prevent this directly and strongly.

If the device requires a drain to halt its DMA, then that is a device
specific sequence. Otherwise it should simply halt its DMA in whatever
device specific way it has.

> > Hacking a DMA stop by forcing a blocking translation is not logically
> > correct, with wrong ordering the device may see unexpected translation
> > failures which may trigger AERs or bad things..
> 
> where is such hack? though the current implementation of draining
> is not clean, it's put inside pasid-disable-sequence instead of forcing
> a blocking translation implicitly in iommu driver i.e. it's still the driver
> making decision for what translation to be used...

It is mis-understanding the purpose of drain.

In normal operating cases PRI just flows and the device will
eventually, naturally, reach a stable terminal case. We don't provide
any ordering guarentees across translation changes so PRI just follows
that design. If you change the translation with ongoing DMA then you
just don't know what order things will happen in.

The main purpose of drain is to keep the PRI protocol itself in sync
against events on the device side that cause it to forget about the
tags it has already issued. Eg a FLR should reset the tag record. If a
device then issues a new PRI with a tag that matches a tag that was
outstanding prior to FLR we can get a corruption.

So any drain sequence should start with the device halting new
PRIs. We flush all PRI tags from the system completely, and then the
device may resume issuing new PRIs.

When the device resets it's tag labels is a property of the device.

Notice none of this has anything to do with change of
translation. Change of translation, or flush of ATC, does not
invalidate the tags.

A secondary case is to help devices halt their DMA when they cannot do
this on their own.

Jason

diff mbox series

Patch

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c17d5979d70d..cd3cdeb69f49 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1431,7 +1431,7 @@  iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm)
 #ifdef CONFIG_IOMMU_IOPF
 int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev);
 int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev);
-int iopf_queue_flush_dev(struct device *dev);
+int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid);
 struct iopf_queue *iopf_queue_alloc(const char *name);
 void iopf_queue_free(struct iopf_queue *queue);
 int iopf_queue_discard_partial(struct iopf_queue *queue);
@@ -1453,7 +1453,7 @@  iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
 	return -ENODEV;
 }
 
-static inline int iopf_queue_flush_dev(struct device *dev)
+static inline int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
 {
 	return -ENODEV;
 }
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 780c5bd73ec2..659de9c16024 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -495,7 +495,7 @@  void intel_drain_pasid_prq(struct device *dev, u32 pasid)
 		goto prq_retry;
 	}
 
-	iopf_queue_flush_dev(dev);
+	iopf_queue_discard_dev_pasid(dev, pasid);
 
 	/*
 	 * Perform steps described in VT-d spec CH7.10 to drain page
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index b80574323cbc..b288c73f2b22 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -260,10 +260,9 @@  int iommu_page_response(struct device *dev,
 
 	/* Only send response if there is a fault report pending */
 	mutex_lock(&fault_param->lock);
-	if (list_empty(&fault_param->faults)) {
-		dev_warn_ratelimited(dev, "no pending PRQ, drop response\n");
+	if (list_empty(&fault_param->faults))
 		goto done_unlock;
-	}
+
 	/*
 	 * Check if we have a matching page request pending to respond,
 	 * otherwise return -EINVAL
@@ -304,30 +303,67 @@  int iommu_page_response(struct device *dev,
 EXPORT_SYMBOL_GPL(iommu_page_response);
 
 /**
- * iopf_queue_flush_dev - Ensure that all queued faults have been processed
- * @dev: the endpoint whose faults need to be flushed.
+ * iopf_queue_discard_dev_pasid - Discard all pending faults for a PASID
+ * @dev: the endpoint whose faults need to be discarded.
+ * @pasid: the PASID of the endpoint.
  *
  * The IOMMU driver calls this before releasing a PASID, to ensure that all
- * pending faults for this PASID have been handled, and won't hit the address
- * space of the next process that uses this PASID. The driver must make sure
- * that no new fault is added to the queue. In particular it must flush its
- * low-level queue before calling this function.
+ * pending faults for this PASID have been handled or dropped, and won't hit
+ * the address space of the next process that uses this PASID. The driver
+ * must make sure that no new fault is added to the queue. In particular it
+ * must flush its low-level queue before calling this function.
  *
  * Return: 0 on success and <0 on error.
  */
-int iopf_queue_flush_dev(struct device *dev)
+int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
 {
 	struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
+	const struct iommu_ops *ops = dev_iommu_ops(dev);
+	struct iommu_page_response resp;
+	struct iopf_fault *iopf, *next;
+	int ret = 0;
 
 	if (!iopf_param)
 		return -ENODEV;
 
 	flush_workqueue(iopf_param->queue->wq);
+
+	mutex_lock(&iopf_param->lock);
+	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) {
+		if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ||
+		    iopf->fault.prm.pasid != pasid)
+			break;
+
+		list_del(&iopf->list);
+		kfree(iopf);
+	}
+
+	list_for_each_entry_safe(iopf, next, &iopf_param->faults, list) {
+		if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) ||
+		    iopf->fault.prm.pasid != pasid)
+			continue;
+
+		memset(&resp, 0, sizeof(struct iommu_page_response));
+		resp.pasid = iopf->fault.prm.pasid;
+		resp.grpid = iopf->fault.prm.grpid;
+		resp.code = IOMMU_PAGE_RESP_INVALID;
+
+		if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID)
+			resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
+
+		ret = ops->page_response(dev, iopf, &resp);
+		if (ret)
+			break;
+
+		list_del(&iopf->list);
+		kfree(iopf);
+	}
+	mutex_unlock(&iopf_param->lock);
 	iopf_put_dev_fault_param(iopf_param);
 
-	return 0;
+	return ret;
 }
-EXPORT_SYMBOL_GPL(iopf_queue_flush_dev);
+EXPORT_SYMBOL_GPL(iopf_queue_discard_dev_pasid);
 
 /**
  * iopf_group_response - Respond a group of page faults