Message ID | 20231115030226.16700-13-baolu.lu@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6358:a59:b0:164:83eb:24d7 with SMTP id 25csp2357396rwb; Tue, 14 Nov 2023 19:09:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IEg5V/ZPqH54/K00EPQM4y/saAXXCoMIfuBhsIvbbLE5HBGD9GfaC7ERe3ZRjvpcWxt0GZh X-Received: by 2002:a17:903:2101:b0:1cc:f60:28b2 with SMTP id o1-20020a170903210100b001cc0f6028b2mr3983601ple.6.1700017740666; Tue, 14 Nov 2023 19:09:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700017740; cv=none; d=google.com; s=arc-20160816; b=TGYBhVQFSC8IdkeckejQ928iZ3qUUE6br95E9ohV9h0Db61oxjd7dmHD9F6qma90Ti 4J6r9gKcGCr2/epJtqNWkcoqG8CZ1EI29MgQ+sG1uTOOrMArtUudWWJHDDt6FyB6QYoQ f9Tr7WdLkvx/gAU7Yo0ebNC9RnKQQH48EUTO2fMIJnVQ/T8dI1c0RTj4nDxkf7iY65dN JxUpa/lB09/udIBInl1tZ8cr+fwdvKvG4gdv+z9fiRbQcLzi9GA93xUHWpZvijNfuf5r AcBDwkriqkAgE6wrHPbG9NrqcM5Rs+a6f9JbXPmBz9hg1CaB332P5JRPziyWCbGxFMgP fygw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=l0HK+O/ocVpxc2Colgt2NGo5vDdtC00NNi++ixzwNLY=; fh=DLFwkwrOYM/4/CX+ox6OE1AaCACujz37TA++iS2u/6Y=; b=Zx7H6as1OOzcESnWwoahUwPnef7hw/fY/RA4QYehU6JQBonIks5NcRPNJFUy9oFC16 iXCMlFb7XsehMPgv9149cD2d6mewHFZEuoyARc/QRjjBJ+qFpkOJGSYYPmNNNvut05/S lHl5aHokcrZ2jcnkcl6R3srEsfSEbkripdxzPLJFVKsuVIXVnSKoXpf2DRi0JtfqiS0K YP8Trrmug6uYZCzApfsJJnamZA7iI31YjTS+eW9sGEgUePqo+4082+JB9NEaGJlcV8OS j+MJqIuCe3gDzr3OTzYIKAo7zZNGFfNohiFjob6lfNtVLrwdz+dNzUDM/2I0SqgyRDzl J8Gw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gy+3Bn9b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id ja7-20020a170902efc700b001cc23d2bb92si8554700plb.650.2023.11.14.19.09.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Nov 2023 19:09:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gy+3Bn9b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 3F36080CBF0D; Tue, 14 Nov 2023 19:08:58 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234454AbjKODIN (ORCPT <rfc822;heyuhang3455@gmail.com> + 28 others); Tue, 14 Nov 2023 22:08:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234379AbjKODHy (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 14 Nov 2023 22:07:54 -0500 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 19F821720; Tue, 14 Nov 2023 19:07:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700017652; x=1731553652; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=B+Yg/spqjl3lbOQyq1c1j/kiy1AnFo0msZxv07NYnZU=; b=gy+3Bn9b73/6lrmOvQoe647uchFbKH079x4L9ipEuNnaZMPVIyULdLgI M2gGhy5cz6fxxJzkjfY8wwJqi2l4S5rglaJZgY1Le1aS4MO+ijlBAWXHS 77fnXnP6rgnUu/7ZZMeNtqGDER+0ZZLTFrXlt1cyFsrM4wdA+o1ZDM/Mr 00BEIzxF78/2BgG9oP+Qu0mUJPd9XrH6bAzP949JHjlrqsBfSRWSowBk5 MLVjIex2XZBXKf5O6RML3dvmDK3uY3Hkcw8dphfvtIyqr70zJhmOKQfJh JTbcPO5ANTQl8lhEUzjXgESSV+Qins1clUOjmaHZou9kB9su1ZRJYTlgj g==; X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="394715518" X-IronPort-AV: E=Sophos;i="6.03,303,1694761200"; d="scan'208";a="394715518" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2023 19:07:25 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10894"; a="1012128954" X-IronPort-AV: E=Sophos;i="6.03,303,1694761200"; d="scan'208";a="1012128954" Received: from allen-box.sh.intel.com ([10.239.159.127]) by fmsmga006.fm.intel.com with ESMTP; 14 Nov 2023 19:07:22 -0800 From: Lu Baolu <baolu.lu@linux.intel.com> To: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Jason Gunthorpe <jgg@ziepe.ca>, Kevin Tian <kevin.tian@intel.com>, Jean-Philippe Brucker <jean-philippe@linaro.org>, Nicolin Chen <nicolinc@nvidia.com> Cc: Yi Liu <yi.l.liu@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, Yan Zhao <yan.y.zhao@intel.com>, iommu@lists.linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Lu Baolu <baolu.lu@linux.intel.com> Subject: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev() Date: Wed, 15 Nov 2023 11:02:26 +0800 Message-Id: <20231115030226.16700-13-baolu.lu@linux.intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231115030226.16700-1-baolu.lu@linux.intel.com> References: <20231115030226.16700-1-baolu.lu@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 14 Nov 2023 19:08:58 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782597802055527780 X-GMAIL-MSGID: 1782597802055527780 |
Series |
iommu: Prepare to deliver page faults to user space
|
|
Commit Message
Baolu Lu
Nov. 15, 2023, 3:02 a.m. UTC
The iopf_queue_flush_dev() is called by the iommu driver before releasing a PASID. It ensures that all pending faults for this PASID have been handled or cancelled, and won't hit the address space that reuses this PASID. The driver must make sure that no new fault is added to the queue. The SMMUv3 driver doesn't use it because it only implements the Arm-specific stall fault model where DMA transactions are held in the SMMU while waiting for the OS to handle iopf's. Since a device driver must complete all DMA transactions before detaching domain, there are no pending iopf's with the stall model. PRI support requires adding a call to iopf_queue_flush_dev() after flushing the hardware page fault queue. The current implementation of iopf_queue_flush_dev() is a simplified version. It is only suitable for SVA case in which the processing of iopf is implemented in the inner loop of the iommu subsystem. Improve this interface to make it also work for handling iopf out of the iommu core. Rename the function with a more meaningful name. Remove a warning message in iommu_page_response() since the iopf queue might get flushed before possible pending responses. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> --- include/linux/iommu.h | 4 +-- drivers/iommu/intel/svm.c | 2 +- drivers/iommu/io-pgfault.c | 60 ++++++++++++++++++++++++++++++-------- 3 files changed, 51 insertions(+), 15 deletions(-)
Comments
On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote: > The iopf_queue_flush_dev() is called by the iommu driver before releasing > a PASID. It ensures that all pending faults for this PASID have been > handled or cancelled, and won't hit the address space that reuses this > PASID. The driver must make sure that no new fault is added to the queue. This needs more explanation, why should anyone care? More importantly, why is *discarding* the right thing to do? Especially why would we discard a partial page request group? After we change a translation we may have PRI requests in a queue. They need to be acknowledged, not discarded. The DMA in the device should be restarted and the device should observe the new translation - if it is blocking then it should take a DMA error. More broadly, we should just let things run their normal course. The domain to deliver the fault to should be determined very early. If we get a fault and there is no fault domain currently assigned then just restart it. The main reason to fence would be to allow the domain to become freed as the faults should be holding pointers to it. But I feel there are simpler options for that then this.. > The SMMUv3 driver doesn't use it because it only implements the > Arm-specific stall fault model where DMA transactions are held in the SMMU > while waiting for the OS to handle iopf's. Since a device driver must > complete all DMA transactions before detaching domain, there are no > pending iopf's with the stall model. PRI support requires adding a call to > iopf_queue_flush_dev() after flushing the hardware page fault queue. This explanation doesn't make much sense, from a device driver perspective both PRI and stall cause the device to not complete DMAs. The difference between stall and PRI is fairly small, stall causes an internal bus to lock up while PRI does not. > -int iopf_queue_flush_dev(struct device *dev) > +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid) > { > struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev); > + const struct iommu_ops *ops = dev_iommu_ops(dev); > + struct iommu_page_response resp; > + struct iopf_fault *iopf, *next; > + int ret = 0; > > if (!iopf_param) > return -ENODEV; > > flush_workqueue(iopf_param->queue->wq); > + A naked flush_workqueue like this is really suspicious, it needs a comment explaining why the queue can't get more work queued at this point. I suppose the driver is expected to stop calling iommu_report_device_fault() before calling this function, but that doesn't seem like it is going to be possible. Drivers should be implementing atomic replace for the PASID updates and in that case there is no momement when it can say the HW will stop generating PRI. I'm looking at this code after these patches are applied and it still seems quite bonkers to me :( Why do we allocate two copies of the memory on all fault paths? Why do we have fault->type still that only has one value? What is serializing iommu_get_domain_for_dev_pasid() in the fault path? It looks sort of like the plan is to use iopf_param->lock and ensure domain removal grabs that lock at least after the xarray is changed - but does that actually happen? I would suggest, broadly, a flow for iommu_report_device_fault() sort of: 1) Allocate memory for the evt. Every path except errors needs this, so just do it 2) iopf_get_dev_fault_param() should not have locks in it! This is fast path now. Use a refcount, atomic compare exchange to allocate, and RCU free. 3) Everything runs under the fault_param->lock 4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then exit! This logic is really tortured and confusing 5) Allocate memory and assemble the group 6) Obtain the domain for this group and incr a per-domain counter that a fault is pending on that domain 7) Put the *group* into the WQ. Put the *group* on a list in fault_param instead of the individual faults 8) Don't linear search a linked list in iommu_page_response()! Pass the group in that we got from the WQ that we *know* is still active. Ack that passed group. When freeing a domain wait for the per-domain counter to go to zero. This ensures that the WQ is flushed out and all the outside domain references are gone. When wanting to turn off PRI make sure a non-PRI domain is attached to everything. Fence against the HW's event queue. No new iommu_report_device_fault() is possible. Lock the fault_param->lock and go through every pending group and respond it. Mark the group memory as invalid so iommu_page_response() NOP's it. Unlock, fence the HW against queued responses, and turn off PRI. An *optimization* would be to lightly flush the domain when changing the translation. Lock the fault_param->lock and look for groups in the list with old_domain. Do the same as for PRI-off: respond to the group, mark it as NOP. The WQ may still be chewing on something so the domain free still has to check and wait. Did I get it right?? Jason
On 12/2/23 4:35 AM, Jason Gunthorpe wrote: > On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote: >> The iopf_queue_flush_dev() is called by the iommu driver before releasing >> a PASID. It ensures that all pending faults for this PASID have been >> handled or cancelled, and won't hit the address space that reuses this >> PASID. The driver must make sure that no new fault is added to the queue. > This needs more explanation, why should anyone care? > > More importantly, why is*discarding* the right thing to do? > Especially why would we discard a partial page request group? > > After we change a translation we may have PRI requests in a > queue. They need to be acknowledged, not discarded. The DMA in the > device should be restarted and the device should observe the new > translation - if it is blocking then it should take a DMA error. > > More broadly, we should just let things run their normal course. The > domain to deliver the fault to should be determined very early. If we > get a fault and there is no fault domain currently assigned then just > restart it. > > The main reason to fence would be to allow the domain to become freed > as the faults should be holding pointers to it. But I feel there are > simpler options for that then this.. In the iommu_detach_device_pasid() path, the domain is about to be removed from the pasid of device. The IOMMU driver performs the following steps sequentially: 1. Clears the pasid translation entry. Thus, all subsequent DMA transactions (translation requests, translated requests or page requests) targeting the iommu domain will be blocked. 2. Waits until all pending page requests for the device's PASID have been reported to upper layers via the iommu_report_device_fault(). However, this does not guarantee that all page requests have been responded. 3. Free all partial page requests for this pasid since the page request response is only needed for a complete request group. There's no action required for the page requests which are not last of a request group. 4. Iterate through the list of pending page requests and identifies those originating from the device's PASID. For each identified request, the driver responds to the hardware with the IOMMU_PAGE_RESP_INVALID code, indicating that the request cannot be handled and retries should not be attempted. This response code corresponds to the "Invalid Request" status defined in the PCI PRI specification. 5. Follow the IOMMU hardware requirements (for example, VT-d sepc, section 7.10, Software Steps to Drain Page Requests & Responses) to drain in-flight page requests and page group responses between the remapping hardware queues and the endpoint device. With above steps done in iommu_detach_device_pasid(), the pasid could be re-used for any other address space. The iopf_queue_discard_dev_pasid() helper does step 3 and 4. > >> The SMMUv3 driver doesn't use it because it only implements the >> Arm-specific stall fault model where DMA transactions are held in the SMMU >> while waiting for the OS to handle iopf's. Since a device driver must >> complete all DMA transactions before detaching domain, there are no >> pending iopf's with the stall model. PRI support requires adding a call to >> iopf_queue_flush_dev() after flushing the hardware page fault queue. > This explanation doesn't make much sense, from a device driver > perspective both PRI and stall cause the device to not complete DMAs. > > The difference between stall and PRI is fairly small, stall causes an > internal bus to lock up while PRI does not. > >> -int iopf_queue_flush_dev(struct device *dev) >> +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid) >> { >> struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev); >> + const struct iommu_ops *ops = dev_iommu_ops(dev); >> + struct iommu_page_response resp; >> + struct iopf_fault *iopf, *next; >> + int ret = 0; >> >> if (!iopf_param) >> return -ENODEV; >> >> flush_workqueue(iopf_param->queue->wq); >> + > A naked flush_workqueue like this is really suspicious, it needs a > comment explaining why the queue can't get more work queued at this > point. > > I suppose the driver is expected to stop calling > iommu_report_device_fault() before calling this function, but that > doesn't seem like it is going to be possible. Drivers should be > implementing atomic replace for the PASID updates and in that case > there is no momement when it can say the HW will stop generating PRI. Atomic domain replacement for a PASID is not currently implemented in the core or driver. Even if atomic replacement were to be implemented, it would be necessary to ensure that all translation requests, translated requests, page requests and responses for the old domain are drained before switching to the new domain. I am not sure whether the existing iommu hardware architecture supports this functionality. Best regards, baolu
On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > On 12/2/23 4:35 AM, Jason Gunthorpe wrote: > > On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote: > > > The iopf_queue_flush_dev() is called by the iommu driver before releasing > > > a PASID. It ensures that all pending faults for this PASID have been > > > handled or cancelled, and won't hit the address space that reuses this > > > PASID. The driver must make sure that no new fault is added to the queue. > > This needs more explanation, why should anyone care? > > > > More importantly, why is*discarding* the right thing to do? > > Especially why would we discard a partial page request group? > > > > After we change a translation we may have PRI requests in a > > queue. They need to be acknowledged, not discarded. The DMA in the > > device should be restarted and the device should observe the new > > translation - if it is blocking then it should take a DMA error. > > > > More broadly, we should just let things run their normal course. The > > domain to deliver the fault to should be determined very early. If we > > get a fault and there is no fault domain currently assigned then just > > restart it. > > > > The main reason to fence would be to allow the domain to become freed > > as the faults should be holding pointers to it. But I feel there are > > simpler options for that then this.. > > In the iommu_detach_device_pasid() path, the domain is about to be > removed from the pasid of device. The IOMMU driver performs the > following steps sequentially: I know that is why it does, but it doesn't explain at all why. > 1. Clears the pasid translation entry. Thus, all subsequent DMA > transactions (translation requests, translated requests or page > requests) targeting the iommu domain will be blocked. > > 2. Waits until all pending page requests for the device's PASID have > been reported to upper layers via the iommu_report_device_fault(). > However, this does not guarantee that all page requests have been > responded. > > 3. Free all partial page requests for this pasid since the page request > response is only needed for a complete request group. There's no > action required for the page requests which are not last of a request > group. But we expect the last to come eventually since everything should be grouped properly, so why bother doing this? Indeed if 2 worked, how is this even possible to have partials? > 5. Follow the IOMMU hardware requirements (for example, VT-d sepc, > section 7.10, Software Steps to Drain Page Requests & Responses) to > drain in-flight page requests and page group responses between the > remapping hardware queues and the endpoint device. > > With above steps done in iommu_detach_device_pasid(), the pasid could be > re-used for any other address space. As I said, that isn't even required. There is no issue with leaking PRI's across attachments. > > I suppose the driver is expected to stop calling > > iommu_report_device_fault() before calling this function, but that > > doesn't seem like it is going to be possible. Drivers should be > > implementing atomic replace for the PASID updates and in that case > > there is no momement when it can say the HW will stop generating PRI. > > Atomic domain replacement for a PASID is not currently implemented in > the core or driver. It is, the driver should implement set_dev_pasid in such a way that repeated calls do replacements, ideally atomically. This is what ARM SMMUv3 does after my changes. > Even if atomic replacement were to be implemented, > it would be necessary to ensure that all translation requests, > translated requests, page requests and responses for the old domain are > drained before switching to the new domain. Again, no it isn't required. Requests simply have to continue to be acked, it doesn't matter if they are acked against the wrong domain because the device will simply re-issue them.. Jason
On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: >> On 12/2/23 4:35 AM, Jason Gunthorpe wrote: >>> On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote: >>>> The iopf_queue_flush_dev() is called by the iommu driver before releasing >>>> a PASID. It ensures that all pending faults for this PASID have been >>>> handled or cancelled, and won't hit the address space that reuses this >>>> PASID. The driver must make sure that no new fault is added to the queue. >>> This needs more explanation, why should anyone care? >>> >>> More importantly, why is*discarding* the right thing to do? >>> Especially why would we discard a partial page request group? >>> >>> After we change a translation we may have PRI requests in a >>> queue. They need to be acknowledged, not discarded. The DMA in the >>> device should be restarted and the device should observe the new >>> translation - if it is blocking then it should take a DMA error. >>> >>> More broadly, we should just let things run their normal course. The >>> domain to deliver the fault to should be determined very early. If we >>> get a fault and there is no fault domain currently assigned then just >>> restart it. >>> >>> The main reason to fence would be to allow the domain to become freed >>> as the faults should be holding pointers to it. But I feel there are >>> simpler options for that then this.. >> >> In the iommu_detach_device_pasid() path, the domain is about to be >> removed from the pasid of device. The IOMMU driver performs the >> following steps sequentially: > > I know that is why it does, but it doesn't explain at all why. > >> 1. Clears the pasid translation entry. Thus, all subsequent DMA >> transactions (translation requests, translated requests or page >> requests) targeting the iommu domain will be blocked. >> >> 2. Waits until all pending page requests for the device's PASID have >> been reported to upper layers via the iommu_report_device_fault(). >> However, this does not guarantee that all page requests have been >> responded. >> >> 3. Free all partial page requests for this pasid since the page request >> response is only needed for a complete request group. There's no >> action required for the page requests which are not last of a request >> group. > > But we expect the last to come eventually since everything should be > grouped properly, so why bother doing this? > > Indeed if 2 worked, how is this even possible to have partials? Step 1 clears the pasid table entry, hence all subsequent page requests are blocked (hardware auto-respond the request but not put it in the queue). It is possible that a portion of a page fault group may have been queued for processing, but the last request is being blocked by hardware due to the pasid entry being in the blocking state. In reality, this may be a no-op as I haven't seen any real-world implementations of multiple-requests fault groups on Intel platforms. > >> 5. Follow the IOMMU hardware requirements (for example, VT-d sepc, >> section 7.10, Software Steps to Drain Page Requests & Responses) to >> drain in-flight page requests and page group responses between the >> remapping hardware queues and the endpoint device. >> >> With above steps done in iommu_detach_device_pasid(), the pasid could be >> re-used for any other address space. > > As I said, that isn't even required. There is no issue with leaking > PRI's across attachments. > > >>> I suppose the driver is expected to stop calling >>> iommu_report_device_fault() before calling this function, but that >>> doesn't seem like it is going to be possible. Drivers should be >>> implementing atomic replace for the PASID updates and in that case >>> there is no momement when it can say the HW will stop generating PRI. >> >> Atomic domain replacement for a PASID is not currently implemented in >> the core or driver. > > It is, the driver should implement set_dev_pasid in such a way that > repeated calls do replacements, ideally atomically. This is what ARM > SMMUv3 does after my changes. > >> Even if atomic replacement were to be implemented, >> it would be necessary to ensure that all translation requests, >> translated requests, page requests and responses for the old domain are >> drained before switching to the new domain. > > Again, no it isn't required. > > Requests simply have to continue to be acked, it doesn't matter if > they are acked against the wrong domain because the device will simply > re-issue them.. Ah! I start to get your point now. Even a page fault response is postponed to a new address space, which possibly be another address space or hardware blocking state, the hardware just retries. As long as we flushes all caches (IOTLB and device TLB) during switching, the mappings of the old domain won't leak. So it's safe to keep page requests there. Do I get you correctly? Best regards, baolu
On 12/2/23 4:35 AM, Jason Gunthorpe wrote: > I'm looking at this code after these patches are applied and it still > seems quite bonkers to me 🙁 > > Why do we allocate two copies of the memory on all fault paths? > > Why do we have fault->type still that only has one value? > > What is serializing iommu_get_domain_for_dev_pasid() in the fault > path? It looks sort of like the plan is to use iopf_param->lock and > ensure domain removal grabs that lock at least after the xarray is > changed - but does that actually happen? > > I would suggest, broadly, a flow for iommu_report_device_fault() sort > of: > > 1) Allocate memory for the evt. Every path except errors needs this, > so just do it > 2) iopf_get_dev_fault_param() should not have locks in it! This is > fast path now. Use a refcount, atomic compare exchange to allocate, > and RCU free. > 3) Everything runs under the fault_param->lock > 4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then > exit! This logic is really tortured and confusing > 5) Allocate memory and assemble the group > 6) Obtain the domain for this group and incr a per-domain counter that a > fault is pending on that domain > 7) Put the*group* into the WQ. Put the*group* on a list in fault_param > instead of the individual faults > 8) Don't linear search a linked list in iommu_page_response()! Pass > the group in that we got from the WQ that we*know* is still > active. Ack that passed group. > > When freeing a domain wait for the per-domain counter to go to > zero. This ensures that the WQ is flushed out and all the outside > domain references are gone. > > When wanting to turn off PRI make sure a non-PRI domain is > attached to everything. Fence against the HW's event queue. No new > iommu_report_device_fault() is possible. > > Lock the fault_param->lock and go through every pending group and > respond it. Mark the group memory as invalid so iommu_page_response() > NOP's it. Unlock, fence the HW against queued responses, and turn off > PRI. > > An*optimization* would be to lightly flush the domain when changing > the translation. Lock the fault_param->lock and look for groups in the > list with old_domain. Do the same as for PRI-off: respond to the > group, mark it as NOP. The WQ may still be chewing on something so the > domain free still has to check and wait. Very appreciated for all the ideas. I looked through the items and felt that all these are good optimizations. I am wondering whether we can take patch 1/12 ~ 10/12 of this series as a first step, a refactoring effort to support delivering iopf to userspace? I will follow up with one or multiple series to add the optimizations. Does this work for you? Or, you want to take any of above as the requirement for iommufd use case? Best regards, baolu
> From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Monday, December 4, 2023 9:33 AM > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > >> Even if atomic replacement were to be implemented, > >> it would be necessary to ensure that all translation requests, > >> translated requests, page requests and responses for the old domain are > >> drained before switching to the new domain. > > > > Again, no it isn't required. > > > > Requests simply have to continue to be acked, it doesn't matter if > > they are acked against the wrong domain because the device will simply > > re-issue them.. > > Ah! I start to get your point now. > > Even a page fault response is postponed to a new address space, which > possibly be another address space or hardware blocking state, the > hardware just retries. if blocking then the device shouldn't retry. btw if a stale request targets an virtual address which is outside of the valid VMA's of the new address space then visible side-effect will be incurred in handle_mm_fault() on the new space. Is it desired? Or if a pending response carries an error code (Invalid Request) from the old address space is received by the device when the new address space is already activated, the hardware will report an error even though there might be a valid mapping in the new space. > > As long as we flushes all caches (IOTLB and device TLB) during > switching, the mappings of the old domain won't leak. So it's safe to > keep page requests there. > I don't think atomic replace is the main usage for this draining requirement. Instead I'm more interested in the basic popular usage: attach-detach-attach and not convinced that no draining is required between iommu/device to avoid interference between activities from old/new address space.
On Mon, Dec 04, 2023 at 09:32:37AM +0800, Baolu Lu wrote: > > > > I know that is why it does, but it doesn't explain at all why. > > > > > 1. Clears the pasid translation entry. Thus, all subsequent DMA > > > transactions (translation requests, translated requests or page > > > requests) targeting the iommu domain will be blocked. > > > > > > 2. Waits until all pending page requests for the device's PASID have > > > been reported to upper layers via the iommu_report_device_fault(). > > > However, this does not guarantee that all page requests have been > > > responded. > > > > > > 3. Free all partial page requests for this pasid since the page request > > > response is only needed for a complete request group. There's no > > > action required for the page requests which are not last of a request > > > group. > > > > But we expect the last to come eventually since everything should be > > grouped properly, so why bother doing this? > > > > Indeed if 2 worked, how is this even possible to have partials? > > Step 1 clears the pasid table entry, hence all subsequent page requests > are blocked (hardware auto-respond the request but not put it in the > queue). OK, that part makes sense, but it should be clearly documented that is why this stuff is going on with the partial list. "We have to clear the parial list as the new domain may not generate a SW visible LAST. If it does generate a SW visible last then we simply incompletely fault it and restart the device which will fix things on retry" > > Requests simply have to continue to be acked, it doesn't matter if > > they are acked against the wrong domain because the device will simply > > re-issue them.. > > Ah! I start to get your point now. > > Even a page fault response is postponed to a new address space, which > possibly be another address space or hardware blocking state, the > hardware just retries. > > As long as we flushes all caches (IOTLB and device TLB) during switching, > the mappings of the old domain won't leak. So it's safe to keep page > requests there. > > Do I get you correctly? Yes It seems much simpler to me than trying to make this synchronous and it is compatible with hitless replace of a PASID. The lifetime and locking rules are also far more understandable Jason
On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote: > > From: Baolu Lu <baolu.lu@linux.intel.com> > > Sent: Monday, December 4, 2023 9:33 AM > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > > >> Even if atomic replacement were to be implemented, > > >> it would be necessary to ensure that all translation requests, > > >> translated requests, page requests and responses for the old domain are > > >> drained before switching to the new domain. > > > > > > Again, no it isn't required. > > > > > > Requests simply have to continue to be acked, it doesn't matter if > > > they are acked against the wrong domain because the device will simply > > > re-issue them.. > > > > Ah! I start to get your point now. > > > > Even a page fault response is postponed to a new address space, which > > possibly be another address space or hardware blocking state, the > > hardware just retries. > > if blocking then the device shouldn't retry. It does retry. The device is waiting on a PRI, it gets back an completion. It issues a new ATS (this is the rety) and the new-domain responds back with a failure indication. If the new domain had a present page it would respond with a translation If the new domain has a non-present page then we get a new PRI. The point is from a device perspective it is always doing something correct. > btw if a stale request targets an virtual address which is outside of the > valid VMA's of the new address space then visible side-effect will > be incurred in handle_mm_fault() on the new space. Is it desired? The whole thing is racy, if someone is radically changing the underlying mappings while DMA is ongoing then there is no way to synchronize 'before' and 'after' against a concurrent external device. So who cares? What we care about is that the ATC is coherent and never has stale data. The invalidation after changing the translation ensures this regardless of any outstanding un-acked PRI. > Or if a pending response carries an error code (Invalid Request) from > the old address space is received by the device when the new address > space is already activated, the hardware will report an error even > though there might be a valid mapping in the new space. Again, all racy. If a DMA is ongoing at the same instant things are changed there is no definitive way to say if it resolved before or after. The only thing we care about is that dmas that are completed before see the before translation and dmas that are started after see the after translation. DMAs that cross choose one at random. > I don't think atomic replace is the main usage for this draining > requirement. Instead I'm more interested in the basic popular usage: > attach-detach-attach and not convinced that no draining is required > between iommu/device to avoid interference between activities > from old/new address space. Something like IDXD needs to halt DMAs on the PASID and flush all outstanding DMA to get to a state where the PASID is quiet from the device perspective. This is the only way to stop interference. If the device is still issuing DMA after the domain changes then it is never going to work right. If *IDXD* needs some help to flush PRIs after it halts DMAs (because it can't do it on its own for some reason) then IDXD should have an explicit call to do that, after suspending new DMA. We don't know what things devices will need to do here, devices that are able to wait for PRIs to complete may want a cancelling flush to speed that up, and that shouldn't be part of the translation change. IOW the act of halting DMA and the act of changing the translation really should be different things. Then we get into interesting questions like what sequence is required for a successful FLR. :\ Jason
On Mon, Dec 04, 2023 at 11:46:30AM +0800, Baolu Lu wrote: > On 12/2/23 4:35 AM, Jason Gunthorpe wrote: > I am wondering whether we can take patch 1/12 ~ 10/12 of this series as > a first step, a refactoring effort to support delivering iopf to > userspace? I will follow up with one or multiple series to add the > optimizations. I think that is reasonable, though I would change the earlier patch to use RCU to obtain the fault data. Jason
On 12/4/23 9:27 PM, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 11:46:30AM +0800, Baolu Lu wrote: >> On 12/2/23 4:35 AM, Jason Gunthorpe wrote: >> I am wondering whether we can take patch 1/12 ~ 10/12 of this series as >> a first step, a refactoring effort to support delivering iopf to >> userspace? I will follow up with one or multiple series to add the >> optimizations. > I think that is reasonable, though I would change the earlier patch to > use RCU to obtain the fault data. All right! I will do this in the updated version. Best regards, baolu
> From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Monday, December 4, 2023 9:25 PM > > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote: > > > From: Baolu Lu <baolu.lu@linux.intel.com> > > > Sent: Monday, December 4, 2023 9:33 AM > > > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > > > >> Even if atomic replacement were to be implemented, > > > >> it would be necessary to ensure that all translation requests, > > > >> translated requests, page requests and responses for the old domain > are > > > >> drained before switching to the new domain. > > > > > > > > Again, no it isn't required. > > > > > > > > Requests simply have to continue to be acked, it doesn't matter if > > > > they are acked against the wrong domain because the device will simply > > > > re-issue them.. > > > > > > Ah! I start to get your point now. > > > > > > Even a page fault response is postponed to a new address space, which > > > possibly be another address space or hardware blocking state, the > > > hardware just retries. > > > > if blocking then the device shouldn't retry. > > It does retry. > > The device is waiting on a PRI, it gets back an completion. It issues > a new ATS (this is the rety) and the new-domain responds back with a > failure indication. I'm not sure that is the standard behavior defined by PCIe spec. According to "10.4.2 Page Request Group Response Message", function's response to Page Request failure is implementation specific. so a new ATS is optional and likely the device will instead abort the DMA if PRI response already indicates a failure. > > If the new domain had a present page it would respond with a > translation > > If the new domain has a non-present page then we get a new PRI. > > The point is from a device perspective it is always doing something > correct. > > > btw if a stale request targets an virtual address which is outside of the > > valid VMA's of the new address space then visible side-effect will > > be incurred in handle_mm_fault() on the new space. Is it desired? > > The whole thing is racy, if someone is radically changing the > underlying mappings while DMA is ongoing then there is no way to > synchronize 'before' and 'after' against a concurrent external device. > > So who cares? > > What we care about is that the ATC is coherent and never has stale > data. The invalidation after changing the translation ensures this > regardless of any outstanding un-acked PRI. > > > Or if a pending response carries an error code (Invalid Request) from > > the old address space is received by the device when the new address > > space is already activated, the hardware will report an error even > > though there might be a valid mapping in the new space. > > Again, all racy. If a DMA is ongoing at the same instant things are > changed there is no definitive way to say if it resolved before or > after. > > The only thing we care about is that dmas that are completed before > see the before translation and dmas that are started after see the > after translation. > > DMAs that cross choose one at random. Yes that makes sense for replacement. But here we are talking about a draining requirement when disabling a pasid entry, which is certainly not involved in replacement. > > > I don't think atomic replace is the main usage for this draining > > requirement. Instead I'm more interested in the basic popular usage: > > attach-detach-attach and not convinced that no draining is required > > between iommu/device to avoid interference between activities > > from old/new address space. > > Something like IDXD needs to halt DMAs on the PASID and flush all > outstanding DMA to get to a state where the PASID is quiet from the > device perspective. This is the only way to stop interference. why is it IDXD specific behavior? I suppose all devices need to quiesce the outstanding DMAs when tearing down the binding between the PASID and previous address space. and here what you described is the normal behavior. In this case I agree that no draining is required in iommu side given the device should have quiesced all outstanding DMAs including page requests. but there are also terminal conditions e.g. when a workqueue is reset after hang hence additional draining is required from the iommu side to ensure all the outstanding page requests/responses are properly handled. vt-d spec defines a draining process to cope with those terminal conditions (see 7.9 Pending Page Request Handling on Terminal Conditions). intel-iommu driver just implements it by default for simplicity (one may consider providing explicit API for drivers to call but not sure of the necessity if such terminal conditions apply to most devices). anyway this is not a fast path. another example might be stop marker. A device using stop marker doesn't need to wait for outstanding page requests. According to PCIe spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device simply marks outstanding page request as stale and sends a stop marker message to the IOMMU. Page responses for those stale requests are ignored. But presumably the iommu driver still needs to drain those requests until the stop marker message in unbind to avoid them incorrectly routed to a new address space in case the PASID is rebound to another process immediately. > > If the device is still issuing DMA after the domain changes then it is > never going to work right. > > If *IDXD* needs some help to flush PRIs after it halts DMAs (because > it can't do it on its own for some reason) then IDXD should have an > explicit call to do that, after suspending new DMA. as above I don't think IDXD itself has any special requirements. We are discussing general device terminal conditions which are considered by the iommu spec. > > We don't know what things devices will need to do here, devices that > are able to wait for PRIs to complete may want a cancelling flush to > speed that up, and that shouldn't be part of the translation change. > > IOW the act of halting DMA and the act of changing the translation > really should be different things. Then we get into interesting > questions like what sequence is required for a successful FLR. :\ > > Jason
On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@ziepe.ca> > > Sent: Monday, December 4, 2023 9:25 PM > > > > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote: > > > > From: Baolu Lu <baolu.lu@linux.intel.com> > > > > Sent: Monday, December 4, 2023 9:33 AM > > > > > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > > > > >> Even if atomic replacement were to be implemented, > > > > >> it would be necessary to ensure that all translation requests, > > > > >> translated requests, page requests and responses for the old domain > > are > > > > >> drained before switching to the new domain. > > > > > > > > > > Again, no it isn't required. > > > > > > > > > > Requests simply have to continue to be acked, it doesn't matter if > > > > > they are acked against the wrong domain because the device will simply > > > > > re-issue them.. > > > > > > > > Ah! I start to get your point now. > > > > > > > > Even a page fault response is postponed to a new address space, which > > > > possibly be another address space or hardware blocking state, the > > > > hardware just retries. > > > > > > if blocking then the device shouldn't retry. > > > > It does retry. > > > > The device is waiting on a PRI, it gets back an completion. It issues > > a new ATS (this is the rety) and the new-domain responds back with a > > failure indication. > > I'm not sure that is the standard behavior defined by PCIe spec. > According to "10.4.2 Page Request Group Response Message", function's > response to Page Request failure is implementation specific. > > so a new ATS is optional and likely the device will instead abort the DMA > if PRI response already indicates a failure. I didn't said the PRI would fail, I said the ATS would fail with a non-present. It has to work this way or it is completely broken with respect to existing races in the mm side. Agents must retry non-present ATS answers until you get a present or a ATS failure. > > Again, all racy. If a DMA is ongoing at the same instant things are > > changed there is no definitive way to say if it resolved before or > > after. > > > > The only thing we care about is that dmas that are completed before > > see the before translation and dmas that are started after see the > > after translation. > > > > DMAs that cross choose one at random. > > Yes that makes sense for replacement. > > But here we are talking about a draining requirement when disabling > a pasid entry, which is certainly not involved in replacement. It is the same argument, you are replacing a PTE that was non-present with one that is failing/blocking - the result of a DMA that crosses this event can be either. > > > I don't think atomic replace is the main usage for this draining > > > requirement. Instead I'm more interested in the basic popular usage: > > > attach-detach-attach and not convinced that no draining is required > > > between iommu/device to avoid interference between activities > > > from old/new address space. > > > > Something like IDXD needs to halt DMAs on the PASID and flush all > > outstanding DMA to get to a state where the PASID is quiet from the > > device perspective. This is the only way to stop interference. > > why is it IDXD specific behavior? I suppose all devices need to quiesce > the outstanding DMAs when tearing down the binding between the > PASID and previous address space. Because it is so simple HW I assume this is why this code is being pushed here :) > but there are also terminal conditions e.g. when a workqueue is > reset after hang hence additional draining is required from the > iommu side to ensure all the outstanding page requests/responses > are properly handled. Then it should be coded as an explicit drain request from device when and where they need it. It should not be integrated into the iommu side because it is nonsensical. Devices expecting consistent behavior must stop DMA before changing translation, and if they need help to do it they must call APIs. Changing translation is not required after a so called "terminal event". > vt-d spec defines a draining process to cope with those terminal > conditions (see 7.9 Pending Page Request Handling on Terminal > Conditions). intel-iommu driver just implements it by default for > simplicity (one may consider providing explicit API for drivers to > call but not sure of the necessity if such terminal conditions > apply to most devices). anyway this is not a fast path. It is not "by default" it is in the wrong place. These terminal conditions are things like FLR. FLR has nothing to do with changing the translation. I can trigger FLR and keep the current translation and still would want to flush out all the PRIs before starting DMA again to avoid protocol confusion. An API is absolutely necessary. Confusing the cases that need draining with translation change is just not logically right. eg we do need to modify VFIO to do the drain on FLR like the spec explains! Draining has to be ordered correctly with whatever the device is doing. Drain needs to come after FLR, for instance. It needs to come after a work queue reset, because drain doesn't make any sense unless it is coupled with a DMA stop at the device. Hacking a DMA stop by forcing a blocking translation is not logically correct, with wrong ordering the device may see unexpected translation failures which may trigger AERs or bad things.. > another example might be stop marker. A device using stop marker > doesn't need to wait for outstanding page requests. According to PCIe > spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device > simply marks outstanding page request as stale and sends a stop > marker message to the IOMMU. Page responses for those stale > requests are ignored. But presumably the iommu driver still needs > to drain those requests until the stop marker message in unbind > to avoid them incorrectly routed to a new address space in case the > PASID is rebound to another process immediately. Stop marker doesn't change anything, in all processing it just removes requests that have yet to complete. If a device is using stop then most likely the whole thing is racy and the OS simply has to be ready to handle stop at any time. Jason
> From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Tuesday, December 5, 2023 9:53 AM > > On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > Sent: Monday, December 4, 2023 9:25 PM > > > > > > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote: > > > > > From: Baolu Lu <baolu.lu@linux.intel.com> > > > > > Sent: Monday, December 4, 2023 9:33 AM > > > > > > > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote: > > > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote: > > > > > >> Even if atomic replacement were to be implemented, > > > > > >> it would be necessary to ensure that all translation requests, > > > > > >> translated requests, page requests and responses for the old > domain > > > are > > > > > >> drained before switching to the new domain. > > > > > > > > > > > > Again, no it isn't required. > > > > > > > > > > > > Requests simply have to continue to be acked, it doesn't matter if > > > > > > they are acked against the wrong domain because the device will > simply > > > > > > re-issue them.. > > > > > > > > > > Ah! I start to get your point now. > > > > > > > > > > Even a page fault response is postponed to a new address space, > which > > > > > possibly be another address space or hardware blocking state, the > > > > > hardware just retries. > > > > > > > > if blocking then the device shouldn't retry. > > > > > > It does retry. > > > > > > The device is waiting on a PRI, it gets back an completion. It issues > > > a new ATS (this is the rety) and the new-domain responds back with a > > > failure indication. > > > > I'm not sure that is the standard behavior defined by PCIe spec. > > > According to "10.4.2 Page Request Group Response Message", function's > > response to Page Request failure is implementation specific. > > > > so a new ATS is optional and likely the device will instead abort the DMA > > if PRI response already indicates a failure. > > I didn't said the PRI would fail, I said the ATS would fail with a > non-present. > > It has to work this way or it is completely broken with respect to > existing races in the mm side. Agents must retry non-present ATS > answers until you get a present or a ATS failure. My understanding of the sequence is like below: <'D' for device, 'I' for IOMMU> (D) send a ATS translation request (I) respond translation result (D) If success then sends DMA to the target page otherwise send a PRI request (I) raise an IOMMU interrupt allowing sw to fix the translation (I) generate a PRI response to device (D) if success then jump to the first step to retry otherwise abort the current request If mm changes the mapping after a success PRI response, mmu notifier callback in iommu driver needs to wait for device tlb invalidation completion which the device will order properly with outstanding DMA requests using the old translation. If you refer to the 'retry' after receiving a success PRI response, then yes. but there is really no reason to retry upon a PRI response failure which indicates that the faulting address is not a valid one which OS would like to fix. > > > > Again, all racy. If a DMA is ongoing at the same instant things are > > > changed there is no definitive way to say if it resolved before or > > > after. > > > > > > The only thing we care about is that dmas that are completed before > > > see the before translation and dmas that are started after see the > > > after translation. > > > > > > DMAs that cross choose one at random. > > > > Yes that makes sense for replacement. > > > > But here we are talking about a draining requirement when disabling > > a pasid entry, which is certainly not involved in replacement. > > It is the same argument, you are replacing a PTE that was non-present s/non-present/present/ > with one that is failing/blocking - the result of a DMA that crosses > this event can be either. kind of > > > > > I don't think atomic replace is the main usage for this draining > > > > requirement. Instead I'm more interested in the basic popular usage: > > > > attach-detach-attach and not convinced that no draining is required > > > > between iommu/device to avoid interference between activities > > > > from old/new address space. > > > > > > Something like IDXD needs to halt DMAs on the PASID and flush all > > > outstanding DMA to get to a state where the PASID is quiet from the > > > device perspective. This is the only way to stop interference. > > > > why is it IDXD specific behavior? I suppose all devices need to quiesce > > the outstanding DMAs when tearing down the binding between the > > PASID and previous address space. > > Because it is so simple HW I assume this is why this code is being > pushed here :) > > > but there are also terminal conditions e.g. when a workqueue is > > reset after hang hence additional draining is required from the > > iommu side to ensure all the outstanding page requests/responses > > are properly handled. > > Then it should be coded as an explicit drain request from device when > and where they need it. > > It should not be integrated into the iommu side because it is > nonsensical. Devices expecting consistent behavior must stop DMA > before changing translation, and if they need help to do it they must > call APIs. Changing translation is not required after a so called > "terminal event". > > > vt-d spec defines a draining process to cope with those terminal > > conditions (see 7.9 Pending Page Request Handling on Terminal > > Conditions). intel-iommu driver just implements it by default for > > simplicity (one may consider providing explicit API for drivers to > > call but not sure of the necessity if such terminal conditions > > apply to most devices). anyway this is not a fast path. > > It is not "by default" it is in the wrong place. These terminal > conditions are things like FLR. FLR has nothing to do with changing > the translation. I can trigger FLR and keep the current translation > and still would want to flush out all the PRIs before starting DMA > again to avoid protocol confusion. > > An API is absolutely necessary. Confusing the cases that need draining > with translation change is just not logically right. > > eg we do need to modify VFIO to do the drain on FLR like the spec > explains! > > Draining has to be ordered correctly with whatever the device is > doing. Drain needs to come after FLR, for instance. It needs to come > after a work queue reset, because drain doesn't make any sense unless > it is coupled with a DMA stop at the device. Okay that makes sense. As Baolu and you already agreed let's separate this fix out of this series. The minor interesting aspect is how to document this requirement clearly so drivers won't skip calling it when sva is enabled. > > Hacking a DMA stop by forcing a blocking translation is not logically > correct, with wrong ordering the device may see unexpected translation > failures which may trigger AERs or bad things.. where is such hack? though the current implementation of draining is not clean, it's put inside pasid-disable-sequence instead of forcing a blocking translation implicitly in iommu driver i.e. it's still the driver making decision for what translation to be used... > > > another example might be stop marker. A device using stop marker > > doesn't need to wait for outstanding page requests. According to PCIe > > spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device > > simply marks outstanding page request as stale and sends a stop > > marker message to the IOMMU. Page responses for those stale > > requests are ignored. But presumably the iommu driver still needs > > to drain those requests until the stop marker message in unbind > > to avoid them incorrectly routed to a new address space in case the > > PASID is rebound to another process immediately. > > Stop marker doesn't change anything, in all processing it just removes > requests that have yet to complete. If a device is using stop then > most likely the whole thing is racy and the OS simply has to be ready > to handle stop at any time. > I'm not sure whether a device supporting stop marker may provide certain abort-dma cmd to not wait. But guess such abort semantics will be likely used in terminal condition too so the same argument still hold. Presumably normal translation change should always use a stop-wait semantics for outstanding DMAs. 😊
On Tue, Dec 05, 2023 at 03:23:05AM +0000, Tian, Kevin wrote: > > I didn't said the PRI would fail, I said the ATS would fail with a > > non-present. > > > > It has to work this way or it is completely broken with respect to > > existing races in the mm side. Agents must retry non-present ATS > > answers until you get a present or a ATS failure. > > My understanding of the sequence is like below: > > <'D' for device, 'I' for IOMMU> > > (D) send a ATS translation request > (I) respond translation result > (D) If success then sends DMA to the target page > otherwise send a PRI request > (I) raise an IOMMU interrupt allowing sw to fix the translation > (I) generate a PRI response to device > (D) if success then jump to the first step to retry > otherwise abort the current request > > If mm changes the mapping after a success PRI response, mmu notifier > callback in iommu driver needs to wait for device tlb invalidation completion > which the device will order properly with outstanding DMA requests using > the old translation. > > If you refer to the 'retry' after receiving a success PRI response, then yes. > > but there is really no reason to retry upon a PRI response failure which > indicates that the faulting address is not a valid one which OS would like > to fix. Right > > Draining has to be ordered correctly with whatever the device is > > doing. Drain needs to come after FLR, for instance. It needs to come > > after a work queue reset, because drain doesn't make any sense unless > > it is coupled with a DMA stop at the device. > > Okay that makes sense. As Baolu and you already agreed let's separate > this fix out of this series. > > The minor interesting aspect is how to document this requirement > clearly so drivers won't skip calling it when sva is enabled. All changes to translation inside kernel drivers should only be done once the DMA is halted, otherwise things possibily become security troubled. We should document this clearly, it is already expected in common cases like using the DMA API and when removing() drivers. It also applies when the driver is manually changing a PASID. The issue is not drain, it is that the HW is still doing DMA on the PASID and the PASID may be assigned to a new process. This kernel *must* prevent this directly and strongly. If the device requires a drain to halt its DMA, then that is a device specific sequence. Otherwise it should simply halt its DMA in whatever device specific way it has. > > Hacking a DMA stop by forcing a blocking translation is not logically > > correct, with wrong ordering the device may see unexpected translation > > failures which may trigger AERs or bad things.. > > where is such hack? though the current implementation of draining > is not clean, it's put inside pasid-disable-sequence instead of forcing > a blocking translation implicitly in iommu driver i.e. it's still the driver > making decision for what translation to be used... It is mis-understanding the purpose of drain. In normal operating cases PRI just flows and the device will eventually, naturally, reach a stable terminal case. We don't provide any ordering guarentees across translation changes so PRI just follows that design. If you change the translation with ongoing DMA then you just don't know what order things will happen in. The main purpose of drain is to keep the PRI protocol itself in sync against events on the device side that cause it to forget about the tags it has already issued. Eg a FLR should reset the tag record. If a device then issues a new PRI with a tag that matches a tag that was outstanding prior to FLR we can get a corruption. So any drain sequence should start with the device halting new PRIs. We flush all PRI tags from the system completely, and then the device may resume issuing new PRIs. When the device resets it's tag labels is a property of the device. Notice none of this has anything to do with change of translation. Change of translation, or flush of ATC, does not invalidate the tags. A secondary case is to help devices halt their DMA when they cannot do this on their own. Jason
diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c17d5979d70d..cd3cdeb69f49 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1431,7 +1431,7 @@ iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm) #ifdef CONFIG_IOMMU_IOPF int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev); int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev); -int iopf_queue_flush_dev(struct device *dev); +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid); struct iopf_queue *iopf_queue_alloc(const char *name); void iopf_queue_free(struct iopf_queue *queue); int iopf_queue_discard_partial(struct iopf_queue *queue); @@ -1453,7 +1453,7 @@ iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev) return -ENODEV; } -static inline int iopf_queue_flush_dev(struct device *dev) +static inline int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid) { return -ENODEV; } diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 780c5bd73ec2..659de9c16024 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -495,7 +495,7 @@ void intel_drain_pasid_prq(struct device *dev, u32 pasid) goto prq_retry; } - iopf_queue_flush_dev(dev); + iopf_queue_discard_dev_pasid(dev, pasid); /* * Perform steps described in VT-d spec CH7.10 to drain page diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index b80574323cbc..b288c73f2b22 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -260,10 +260,9 @@ int iommu_page_response(struct device *dev, /* Only send response if there is a fault report pending */ mutex_lock(&fault_param->lock); - if (list_empty(&fault_param->faults)) { - dev_warn_ratelimited(dev, "no pending PRQ, drop response\n"); + if (list_empty(&fault_param->faults)) goto done_unlock; - } + /* * Check if we have a matching page request pending to respond, * otherwise return -EINVAL @@ -304,30 +303,67 @@ int iommu_page_response(struct device *dev, EXPORT_SYMBOL_GPL(iommu_page_response); /** - * iopf_queue_flush_dev - Ensure that all queued faults have been processed - * @dev: the endpoint whose faults need to be flushed. + * iopf_queue_discard_dev_pasid - Discard all pending faults for a PASID + * @dev: the endpoint whose faults need to be discarded. + * @pasid: the PASID of the endpoint. * * The IOMMU driver calls this before releasing a PASID, to ensure that all - * pending faults for this PASID have been handled, and won't hit the address - * space of the next process that uses this PASID. The driver must make sure - * that no new fault is added to the queue. In particular it must flush its - * low-level queue before calling this function. + * pending faults for this PASID have been handled or dropped, and won't hit + * the address space of the next process that uses this PASID. The driver + * must make sure that no new fault is added to the queue. In particular it + * must flush its low-level queue before calling this function. * * Return: 0 on success and <0 on error. */ -int iopf_queue_flush_dev(struct device *dev) +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid) { struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev); + const struct iommu_ops *ops = dev_iommu_ops(dev); + struct iommu_page_response resp; + struct iopf_fault *iopf, *next; + int ret = 0; if (!iopf_param) return -ENODEV; flush_workqueue(iopf_param->queue->wq); + + mutex_lock(&iopf_param->lock); + list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) { + if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) || + iopf->fault.prm.pasid != pasid) + break; + + list_del(&iopf->list); + kfree(iopf); + } + + list_for_each_entry_safe(iopf, next, &iopf_param->faults, list) { + if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) || + iopf->fault.prm.pasid != pasid) + continue; + + memset(&resp, 0, sizeof(struct iommu_page_response)); + resp.pasid = iopf->fault.prm.pasid; + resp.grpid = iopf->fault.prm.grpid; + resp.code = IOMMU_PAGE_RESP_INVALID; + + if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID) + resp.flags = IOMMU_PAGE_RESP_PASID_VALID; + + ret = ops->page_response(dev, iopf, &resp); + if (ret) + break; + + list_del(&iopf->list); + kfree(iopf); + } + mutex_unlock(&iopf_param->lock); iopf_put_dev_fault_param(iopf_param); - return 0; + return ret; } -EXPORT_SYMBOL_GPL(iopf_queue_flush_dev); +EXPORT_SYMBOL_GPL(iopf_queue_discard_dev_pasid); /** * iopf_group_response - Respond a group of page faults