Message ID | 20221116171656.4128212-8-schnelle@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp264340wru; Wed, 16 Nov 2022 09:29:11 -0800 (PST) X-Google-Smtp-Source: AA0mqf4bk5tglzq6ug5tTtHWooVnlZQhmbiD7u7CSDoEIyKEW2+ldj4BukVVxwovrzKCFjT7Gi1u X-Received: by 2002:a05:6402:611:b0:466:7500:b5df with SMTP id n17-20020a056402061100b004667500b5dfmr20129493edv.48.1668619751612; Wed, 16 Nov 2022 09:29:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668619751; cv=none; d=google.com; s=arc-20160816; b=Rg5RJz9fVR89yOjas7rN9+HbQiUKirMBZ0pU0ek6O0CbQw5L4OINe6gFr6xCpOxAuM 7VYAsVqEG6MpyCdHMWeOSTXFzXxfih067lUDKfv9bBC9AIStSM/XXGGay0AIzBxaY2cP diL+Wwi52Swonbgpm74N47KgrbLrZFbJycO5cK7CkYpfpauc2EyAn+2bHc472NoDzAKo mhCyz9vIyNphiWbYc0u0ZX88yTKul08+RJ1Avek9qyL6MivoIeY8Ow4mVHG10ahD9iuE tbmlgy8OtcY54/XCy70wEc6PY8RR5CbZCgO8z2RsSxLktpro7P2ZHO7BeaD4sToqNcgZ K6HQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tvO5QRhnZQNwwKCc2yWc5VgfQdhcmKr/lkbJbaWChko=; b=CK/Rz3WVJmD2trZ2gg98ERcI3wrYLZvJC64RdQqpc95l2zPGQRG0sMuFwwWmckeVQH ofnl/6sX+lohVbqM3reOgfCi0UxYYQpyy/13YHlyUuhkoaPJoYYrNPRDiywP3+DleKYc EbSJu95tLvkYfiP2riAfqxH63EnqnSbTDdr51j4lETl0OuIK3hn3B4zj66P5NhU7k8Ue 9FQ/MJgqFRbBqGLOq4nWX5O5Uifk07QIFcnlxBIfb3G8hpSoIMctXorMZNOl2zxBk5d/ IYwhSlZ3QCYHY/9suQEiiupoa2VLqLSnfqIiPOu+H2nJ2knlRTHKXY33GVhBXmPHbBE/ aoIw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=G57pzugs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id oz12-20020a1709077d8c00b0079198b89adbsi15323840ejc.890.2022.11.16.09.28.48; Wed, 16 Nov 2022 09:29:11 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=G57pzugs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238529AbiKPRRv (ORCPT <rfc822;just.gull.subs@gmail.com> + 99 others); Wed, 16 Nov 2022 12:17:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234304AbiKPRRk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 16 Nov 2022 12:17:40 -0500 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A1625B591; Wed, 16 Nov 2022 09:17:25 -0800 (PST) Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 2AGGg24V031453; Wed, 16 Nov 2022 17:17:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=tvO5QRhnZQNwwKCc2yWc5VgfQdhcmKr/lkbJbaWChko=; b=G57pzugsozEV35YcVL6LNsPFWKSHx4/WhveKRFpulayy74ZY2OriAJ3LPTbipmuf0uG/ Nz4YpAxRS8X8IhFbO1yK1ZIWusLa3fzdqjyPfbd/RMjyooUKbdknjRPeZS1v1PCTgBq8 q+ouHeR419B2EpmfO9UyRPt76p5lYQir0zBXZhpFvBaNCNgjVvZ6Mwi42G08/9jRpwEk VrV6spQfX97ET6QmaF5NjJQtl6lMTWdP+fM1kdW7bw+HizZQfQBjo7MaRDZOmDIan56M XfbGLSY7BoTpFrxrPLWOMx3ny4b/mW34x5O0/nfbwJjKp4GFQ1PNyntj52z5NiGB5Rtl WQ== Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3kw3m50w66-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 16 Nov 2022 17:17:05 +0000 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2AGH4kwp006968; Wed, 16 Nov 2022 17:17:04 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma05fra.de.ibm.com with ESMTP id 3kt3494qum-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 16 Nov 2022 17:17:03 +0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2AGHH02G60490164 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 16 Nov 2022 17:17:00 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D40D552054; Wed, 16 Nov 2022 17:17:00 +0000 (GMT) Received: from tuxmaker.boeblingen.de.ibm.com (unknown [9.152.85.9]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 54A0D52050; Wed, 16 Nov 2022 17:17:00 +0000 (GMT) From: Niklas Schnelle <schnelle@linux.ibm.com> To: Matthew Rosato <mjrosato@linux.ibm.com>, Gerd Bayer <gbayer@linux.ibm.com>, iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Jason Gunthorpe <jgg@nvidia.com>, Wenjia Zhang <wenjia@linux.ibm.com> Cc: Pierre Morel <pmorel@linux.ibm.com>, linux-s390@vger.kernel.org, borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com, gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com, svens@linux.ibm.com, linux-kernel@vger.kernel.org, Julian Ruess <julianr@linux.ibm.com> Subject: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Date: Wed, 16 Nov 2022 18:16:56 +0100 Message-Id: <20221116171656.4128212-8-schnelle@linux.ibm.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221116171656.4128212-1-schnelle@linux.ibm.com> References: <20221116171656.4128212-1-schnelle@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 3urLWCGZdldKrrByunqB7bPZQSfeAO4S X-Proofpoint-ORIG-GUID: 3urLWCGZdldKrrByunqB7bPZQSfeAO4S X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-16_03,2022-11-16_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 priorityscore=1501 suspectscore=0 mlxlogscore=999 phishscore=0 malwarescore=0 spamscore=0 adultscore=0 bulkscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2211160119 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749674624854384940?= X-GMAIL-MSGID: =?utf-8?q?1749674624854384940?= |
Series |
iommu/dma: s390 DMA API conversion and optimized IOTLB flushing
|
|
Commit Message
Niklas Schnelle
Nov. 16, 2022, 5:16 p.m. UTC
When RPCIT indicates that the underlying hypervisor has run out of
resources it often means that its IOVA space is exhausted and IOVAs need
to be freed before new ones can be created. By triggering a flush of the
IOVA queue we can get the queued IOVAs freed and also get the new
mapping established during the global flush.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
drivers/iommu/dma-iommu.c | 14 +++++++++-----
drivers/iommu/dma-iommu.h | 1 +
drivers/iommu/s390-iommu.c | 7 +++++--
3 files changed, 15 insertions(+), 7 deletions(-)
Comments
On 2022-11-16 17:16, Niklas Schnelle wrote: > When RPCIT indicates that the underlying hypervisor has run out of > resources it often means that its IOVA space is exhausted and IOVAs need > to be freed before new ones can be created. By triggering a flush of the > IOVA queue we can get the queued IOVAs freed and also get the new > mapping established during the global flush. Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is exhausted and fail the DMA API call before even getting as far as iommu_map(), though? Or is there some less obvious limitation like a maximum total number of distinct IOVA regions regardless of size? Other than the firmware reserved region helpers which are necessarily a bit pick-and-mix, I've been trying to remove all the iommu-dma details from drivers, so I'd really like to maintain that separation if at all possible. > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> > --- > drivers/iommu/dma-iommu.c | 14 +++++++++----- > drivers/iommu/dma-iommu.h | 1 + > drivers/iommu/s390-iommu.c | 7 +++++-- > 3 files changed, 15 insertions(+), 7 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 3801cdf11aa8..54e7f63fd0d9 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie) > spin_unlock_irqrestore(&fq->lock, flags); > } > > -static void fq_flush_timeout(struct timer_list *t) > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie) > { > - struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > - > - atomic_set(&cookie->fq_timer_on, 0); > fq_flush_iotlb(cookie); > - > if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ) > fq_flush_percpu(cookie); > else > fq_flush_single(cookie); > } > > +static void fq_flush_timeout(struct timer_list *t) > +{ > + struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > + > + atomic_set(&cookie->fq_timer_on, 0); > + iommu_dma_flush_fq(cookie); > +} > + > static void queue_iova(struct iommu_dma_cookie *cookie, > unsigned long pfn, unsigned long pages, > struct list_head *freelist) > diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h > index 942790009292..cac06030aa26 100644 > --- a/drivers/iommu/dma-iommu.h > +++ b/drivers/iommu/dma-iommu.h > @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain); > void iommu_put_dma_cookie(struct iommu_domain *domain); > > int iommu_dma_init_fq(struct iommu_domain *domain); > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie); > > void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); > > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > index 087bb2acff30..9c2782c4043e 100644 > --- a/drivers/iommu/s390-iommu.c > +++ b/drivers/iommu/s390-iommu.c > @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain, > { > struct s390_domain *s390_domain = to_s390_domain(domain); > struct zpci_dev *zdev; > + int rc; > > rcu_read_lock(); > list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) { > if (!zdev->tlb_refresh) > continue; > atomic64_inc(&s390_domain->ctrs.sync_map_rpcits); > - zpci_refresh_trans((u64)zdev->fh << 32, > - iova, size); > + rc = zpci_refresh_trans((u64)zdev->fh << 32, > + iova, size); > + if (rc == -ENOMEM) > + iommu_dma_flush_fq(domain->iova_cookie); Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or IOMMU_DOMAIN_UNMANAGED domain? However I can't figure out how this is supposed to work anyway - .sync_map only gets called if .map claimed that the actual mapping(s) succeeded, it can't fail itself, and even if it does free up some IOVAs at this point by draining the flush queue, I don't see how the mapping then gets retried, or what happens if it still fails after that :/ Thanks, Robin. > } > rcu_read_unlock(); > }
On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: > On 2022-11-16 17:16, Niklas Schnelle wrote: > > When RPCIT indicates that the underlying hypervisor has run out of > > resources it often means that its IOVA space is exhausted and IOVAs need > > to be freed before new ones can be created. By triggering a flush of the > > IOVA queue we can get the queued IOVAs freed and also get the new > > mapping established during the global flush. > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is > exhausted and fail the DMA API call before even getting as far as > iommu_map(), though? Or is there some less obvious limitation like a > maximum total number of distinct IOVA regions regardless of size? Well, yes and no. Your thinking is of course correct if the advertised available IOVA space can be fully utilized without exhausting hypervisor resources we won't trigger this case. However sadly there are complications. The most obvious being that in QEMU/KVM the restriction of the IOVA space to what QEMU can actually have mapped at once was just added recently[0] prior to that we would regularly go through this "I'm out of resources free me some IOVAs" dance with our existing DMA API implementation where this just triggers an early cycle of freeing all unused IOVAs followed by a global flush. On z/VM I know of no situations where this is triggered. That said this signalling is architected so z/VM may have corner cases where it does this. On our bare metal hypervisor (no paging) this return code is unused and IOTLB flushes are simply hardware cache flushes as on bare metal platforms. [0] https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ > > Other than the firmware reserved region helpers which are necessarily a > bit pick-and-mix, I've been trying to remove all the iommu-dma details > from drivers, so I'd really like to maintain that separation if at all > possible. Hmm, tough one. Having a flush queue implies that we're holding on to IOVAs that we could free and this is kind of directly architected into our IOTLB flush with this "free some IOVAs and try again" error return. > > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> > > --- > > drivers/iommu/dma-iommu.c | 14 +++++++++----- > > drivers/iommu/dma-iommu.h | 1 + > > drivers/iommu/s390-iommu.c | 7 +++++-- > > 3 files changed, 15 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > > index 3801cdf11aa8..54e7f63fd0d9 100644 > > --- a/drivers/iommu/dma-iommu.c > > +++ b/drivers/iommu/dma-iommu.c > > @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie) > > spin_unlock_irqrestore(&fq->lock, flags); > > } > > > > -static void fq_flush_timeout(struct timer_list *t) > > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie) > > { > > - struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > > - > > - atomic_set(&cookie->fq_timer_on, 0); > > fq_flush_iotlb(cookie); > > - > > if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ) > > fq_flush_percpu(cookie); > > else > > fq_flush_single(cookie); > > } > > > > +static void fq_flush_timeout(struct timer_list *t) > > +{ > > + struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > > + > > + atomic_set(&cookie->fq_timer_on, 0); > > + iommu_dma_flush_fq(cookie); > > +} > > + > > static void queue_iova(struct iommu_dma_cookie *cookie, > > unsigned long pfn, unsigned long pages, > > struct list_head *freelist) > > diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h > > index 942790009292..cac06030aa26 100644 > > --- a/drivers/iommu/dma-iommu.h > > +++ b/drivers/iommu/dma-iommu.h > > @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain); > > void iommu_put_dma_cookie(struct iommu_domain *domain); > > > > int iommu_dma_init_fq(struct iommu_domain *domain); > > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie); > > > > void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); > > > > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > > index 087bb2acff30..9c2782c4043e 100644 > > --- a/drivers/iommu/s390-iommu.c > > +++ b/drivers/iommu/s390-iommu.c > > @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain, > > { > > struct s390_domain *s390_domain = to_s390_domain(domain); > > struct zpci_dev *zdev; > > + int rc; > > > > rcu_read_lock(); > > list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) { > > if (!zdev->tlb_refresh) > > continue; > > atomic64_inc(&s390_domain->ctrs.sync_map_rpcits); > > - zpci_refresh_trans((u64)zdev->fh << 32, > > - iova, size); > > + rc = zpci_refresh_trans((u64)zdev->fh << 32, > > + iova, size); > > + if (rc == -ENOMEM) > > + iommu_dma_flush_fq(domain->iova_cookie); > > Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or > IOMMU_DOMAIN_UNMANAGED domain? In theory yes and then iommu_dma_flush_fq() still does the .flush_iotlb_all to give the hypervisor a chance to look for freed IOVAs but without flush queues you're really just running out of IOVA space and that's futile. This does highlight an important missed issue though. As we don't return the resulting error from the subsequent .flush_iotlb_all we only find out if this didn't work once the mapping is used while in our current DMA API implementation we correctly return DMA_MAPPING_ERROR in this case. I guess this means we do need error returns from the IOTLB helpers since in a paged guest this is where we finally find out that our mapping couldn't be synced to the hypervisor's shadow table and I don't really see a way around that. Also there are other error conditions implied in this shadowing too, for example the hypervisor could simply fail to pin guest memory and while we can't do anything about that at least we should fail the mapping operation. > > However I can't figure out how this is supposed to work anyway - > .sync_map only gets called if .map claimed that the actual mapping(s) > succeeded, it can't fail itself, and even if it does free up some IOVAs > at this point by draining the flush queue, I don't see how the mapping > then gets retried, or what happens if it still fails after that :/ > > Thanks, > Robin. Yeah, this is a bit non obvious and you are correct in that the architecture requires a subseqeunt IOTLB flush i.e. retry for the range that returned the error. And your last point is then exactly the issue above that we miss if the retry still failed. As for the good path, in the mapping operation but before the .sync_map we have updated the IOMMU translation table so the translation is the recorded but not synced to the hypervisor's shadow table. Now when the .sync_map is called and the IOTLB flush returns -ENOMEM then iommu_dma_flush_fq() will call .flush_iotlb_all which causes the hypervisor to look at the entire guest translation table and shadow all translations that were not yet shadowed. I.e. the .flush_iotlb_all "retries" the failed .sync_map. > > > } > > rcu_read_unlock(); > > }
On 2022-11-29 12:00, Niklas Schnelle wrote: > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: >> On 2022-11-16 17:16, Niklas Schnelle wrote: >>> When RPCIT indicates that the underlying hypervisor has run out of >>> resources it often means that its IOVA space is exhausted and IOVAs need >>> to be freed before new ones can be created. By triggering a flush of the >>> IOVA queue we can get the queued IOVAs freed and also get the new >>> mapping established during the global flush. >> >> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is >> exhausted and fail the DMA API call before even getting as far as >> iommu_map(), though? Or is there some less obvious limitation like a >> maximum total number of distinct IOVA regions regardless of size? > > Well, yes and no. Your thinking is of course correct if the advertised > available IOVA space can be fully utilized without exhausting > hypervisor resources we won't trigger this case. However sadly there > are complications. The most obvious being that in QEMU/KVM the > restriction of the IOVA space to what QEMU can actually have mapped at > once was just added recently[0] prior to that we would regularly go > through this "I'm out of resources free me some IOVAs" dance with our > existing DMA API implementation where this just triggers an early cycle > of freeing all unused IOVAs followed by a global flush. On z/VM I know > of no situations where this is triggered. That said this signalling is > architected so z/VM may have corner cases where it does this. On our > bare metal hypervisor (no paging) this return code is unused and IOTLB > flushes are simply hardware cache flushes as on bare metal platforms. > > [0] > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ That sheds a bit more light, thanks, although I'm still not confident I fully understand the whole setup. AFAICS that patch looks to me like it's putting a fixed limit on the size of the usable address space. That in turn implies that "free some IOVAs and try again" might be a red herring and never going to work; for your current implementation, what that presumably means in reality is "free some IOVAs, resetting the allocator to start allocating lower down in the address space where it will happen to be below that limit, and try again", but the iommu-dma allocator won't do that. If it doesn't know that some arbitrary range below the top of the driver-advertised aperture is unusable, it will just keep allocating IOVAs up there and mappings will always fail. If the driver can't accurately represent the usable IOVA space via the aperture and/or reserved regions, then this whole approach seems doomed. If on the other hand I've misunderstood and you can actually still use any address, just not all of them at the same time, then it might in fact be considerably easier to skip the flush queue mechanism entirely and implement this internally to the driver - basically make .iotlb_sync a no-op for non-strict DMA domains, put the corresponding RPCIT flush and retry in .sync_map, then allow that to propagate an error back to iommu_map() if the new mapping still hasn't taken. Thanks, Robin. >> Other than the firmware reserved region helpers which are necessarily a >> bit pick-and-mix, I've been trying to remove all the iommu-dma details >> from drivers, so I'd really like to maintain that separation if at all >> possible. > > Hmm, tough one. Having a flush queue implies that we're holding on to > IOVAs that we could free and this is kind of directly architected into > our IOTLB flush with this "free some IOVAs and try again" error return. > >> >>> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> >>> --- >>> drivers/iommu/dma-iommu.c | 14 +++++++++----- >>> drivers/iommu/dma-iommu.h | 1 + >>> drivers/iommu/s390-iommu.c | 7 +++++-- >>> 3 files changed, 15 insertions(+), 7 deletions(-) >>> >>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c >>> index 3801cdf11aa8..54e7f63fd0d9 100644 >>> --- a/drivers/iommu/dma-iommu.c >>> +++ b/drivers/iommu/dma-iommu.c >>> @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie) >>> spin_unlock_irqrestore(&fq->lock, flags); >>> } >>> >>> -static void fq_flush_timeout(struct timer_list *t) >>> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie) >>> { >>> - struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); >>> - >>> - atomic_set(&cookie->fq_timer_on, 0); >>> fq_flush_iotlb(cookie); >>> - >>> if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ) >>> fq_flush_percpu(cookie); >>> else >>> fq_flush_single(cookie); >>> } >>> >>> +static void fq_flush_timeout(struct timer_list *t) >>> +{ >>> + struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); >>> + >>> + atomic_set(&cookie->fq_timer_on, 0); >>> + iommu_dma_flush_fq(cookie); >>> +} >>> + >>> static void queue_iova(struct iommu_dma_cookie *cookie, >>> unsigned long pfn, unsigned long pages, >>> struct list_head *freelist) >>> diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h >>> index 942790009292..cac06030aa26 100644 >>> --- a/drivers/iommu/dma-iommu.h >>> +++ b/drivers/iommu/dma-iommu.h >>> @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain); >>> void iommu_put_dma_cookie(struct iommu_domain *domain); >>> >>> int iommu_dma_init_fq(struct iommu_domain *domain); >>> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie); >>> >>> void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); >>> >>> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c >>> index 087bb2acff30..9c2782c4043e 100644 >>> --- a/drivers/iommu/s390-iommu.c >>> +++ b/drivers/iommu/s390-iommu.c >>> @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain, >>> { >>> struct s390_domain *s390_domain = to_s390_domain(domain); >>> struct zpci_dev *zdev; >>> + int rc; >>> >>> rcu_read_lock(); >>> list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) { >>> if (!zdev->tlb_refresh) >>> continue; >>> atomic64_inc(&s390_domain->ctrs.sync_map_rpcits); >>> - zpci_refresh_trans((u64)zdev->fh << 32, >>> - iova, size); >>> + rc = zpci_refresh_trans((u64)zdev->fh << 32, >>> + iova, size); >>> + if (rc == -ENOMEM) >>> + iommu_dma_flush_fq(domain->iova_cookie); >> >> Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or >> IOMMU_DOMAIN_UNMANAGED domain? > > In theory yes and then iommu_dma_flush_fq() still does the > .flush_iotlb_all to give the hypervisor a chance to look for freed > IOVAs but without flush queues you're really just running out of IOVA > space and that's futile. > > This does highlight an important missed issue though. As we don't > return the resulting error from the subsequent .flush_iotlb_all we only > find out if this didn't work once the mapping is used while in our > current DMA API implementation we correctly return DMA_MAPPING_ERROR in > this case. I guess this means we do need error returns from the IOTLB > helpers since in a paged guest this is where we finally find out that > our mapping couldn't be synced to the hypervisor's shadow table and I > don't really see a way around that. Also there are other error > conditions implied in this shadowing too, for example the hypervisor > could simply fail to pin guest memory and while we can't do anything > about that at least we should fail the mapping operation. > >> >> However I can't figure out how this is supposed to work anyway - >> .sync_map only gets called if .map claimed that the actual mapping(s) >> succeeded, it can't fail itself, and even if it does free up some IOVAs >> at this point by draining the flush queue, I don't see how the mapping >> then gets retried, or what happens if it still fails after that :/ >> >> Thanks, >> Robin. > > Yeah, this is a bit non obvious and you are correct in that the > architecture requires a subseqeunt IOTLB flush i.e. retry for the range > that returned the error. And your last point is then exactly the issue > above that we miss if the retry still failed. > > As for the good path, in the mapping operation but before the .sync_map > we have updated the IOMMU translation table so the translation is the > recorded but not synced to the hypervisor's shadow table. Now when the > .sync_map is called and the IOTLB flush returns -ENOMEM then > iommu_dma_flush_fq() will call .flush_iotlb_all which causes the > hypervisor to look at the entire guest translation table and shadow all > translations that were not yet shadowed. I.e. the .flush_iotlb_all > "retries" the failed .sync_map. > >> >>> } >>> rcu_read_unlock(); >>> } > >
On 11/29/22 7:00 AM, Niklas Schnelle wrote: > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: >> On 2022-11-16 17:16, Niklas Schnelle wrote: >>> When RPCIT indicates that the underlying hypervisor has run out of >>> resources it often means that its IOVA space is exhausted and IOVAs need >>> to be freed before new ones can be created. By triggering a flush of the >>> IOVA queue we can get the queued IOVAs freed and also get the new >>> mapping established during the global flush. >> >> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is >> exhausted and fail the DMA API call before even getting as far as >> iommu_map(), though? Or is there some less obvious limitation like a >> maximum total number of distinct IOVA regions regardless of size? > > Well, yes and no. Your thinking is of course correct if the advertised > available IOVA space can be fully utilized without exhausting > hypervisor resources we won't trigger this case. However sadly there > are complications. The most obvious being that in QEMU/KVM the > restriction of the IOVA space to what QEMU can actually have mapped at > once was just added recently[0] prior to that we would regularly go > through this "I'm out of resources free me some IOVAs" dance with our > existing DMA API implementation where this just triggers an early cycle > of freeing all unused IOVAs followed by a global flush. On z/VM I know > of no situations where this is triggered. That said this signalling is While the QEMU case made for an easily-reproducible scenario, the indication was really provided to handle a scenario where you have multiple pageable guests whose sum of total memory is overcommitted as compared to the hypervisor's resources. The intent is for the entire advertised guest aperture to be usable and generally-speaking it is, but it's possible to find yourself in a (poorly tuned) scenario where the hypervisor is unable to pin additional pages (basically an OOM condition) -- the hypervisor (qemu/kvm or z/VM) can use this as a cry for help to say "stop what you're doing and flush your queues immediately so I can unpin as much as possible", and then after that the guest(s) can continue using their aperture. This is unnecessary for the no-paging bare metal hypervisor because there you aren't over-committing memory.
On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote: > On 2022-11-29 12:00, Niklas Schnelle wrote: > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: > > > On 2022-11-16 17:16, Niklas Schnelle wrote: > > > > When RPCIT indicates that the underlying hypervisor has run out of > > > > resources it often means that its IOVA space is exhausted and IOVAs need > > > > to be freed before new ones can be created. By triggering a flush of the > > > > IOVA queue we can get the queued IOVAs freed and also get the new > > > > mapping established during the global flush. > > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is > > > exhausted and fail the DMA API call before even getting as far as > > > iommu_map(), though? Or is there some less obvious limitation like a > > > maximum total number of distinct IOVA regions regardless of size? > > > > Well, yes and no. Your thinking is of course correct if the advertised > > available IOVA space can be fully utilized without exhausting > > hypervisor resources we won't trigger this case. However sadly there > > are complications. The most obvious being that in QEMU/KVM the > > restriction of the IOVA space to what QEMU can actually have mapped at > > once was just added recently[0] prior to that we would regularly go > > through this "I'm out of resources free me some IOVAs" dance with our > > existing DMA API implementation where this just triggers an early cycle > > of freeing all unused IOVAs followed by a global flush. On z/VM I know > > of no situations where this is triggered. That said this signalling is > > architected so z/VM may have corner cases where it does this. On our > > bare metal hypervisor (no paging) this return code is unused and IOTLB > > flushes are simply hardware cache flushes as on bare metal platforms. > > > > [0] > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ > > That sheds a bit more light, thanks, although I'm still not confident I > fully understand the whole setup. AFAICS that patch looks to me like > it's putting a fixed limit on the size of the usable address space. That > in turn implies that "free some IOVAs and try again" might be a red > herring and never going to work; for your current implementation, what > that presumably means in reality is "free some IOVAs, resetting the > allocator to start allocating lower down in the address space where it > will happen to be below that limit, and try again", but the iommu-dma > allocator won't do that. If it doesn't know that some arbitrary range > below the top of the driver-advertised aperture is unusable, it will > just keep allocating IOVAs up there and mappings will always fail. > > If the driver can't accurately represent the usable IOVA space via the > aperture and/or reserved regions, then this whole approach seems doomed. > If on the other hand I've misunderstood and you can actually still use > any address, just not all of them at the same time, This is exactly it, the problem is a limit on the number of IOVAs that are concurrently mapped. In QEMU pass-through the tightest limit is usually the one set by the host kernel parameter vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With IOMMU_DOMAIN_DMA we stay under this limit without extra action but once there is a flush queue (including the existing per-CPU one) where each entry may keep many pages lazily unmapped this is easly hit with fio bandwidth tests on an NVMe. For this case this patch works reliably because of course the number of actually active mappings without the lazily freed ones is similar to the number of active ones with IOMMU_DOMAIN_DMA. > then it might in > fact be considerably easier to skip the flush queue mechanism entirely > and implement this internally to the driver - basically make .iotlb_sync > a no-op for non-strict DMA domains, I'm assuming you mean .iotlb_sync_map above. > put the corresponding RPCIT flush > and retry in .sync_map, then allow that to propagate an error back to > iommu_map() if the new mapping still hasn't taken. > > Thanks, > Robin. Hmm, interesting. This would leave the IOVAs in the flush queue lazily unmapped and thus still block their re-use but free their host resources via a global RPCIT allowing the guest to use a different porition of the IOVA space with those resources. It could work, though I need to test it, but it feels a bit clunky. Maybe we can go cleaner while using this idea of not having to flush the queue but just freeing their host side resources. If we allowed .iotlb_sync_map to return an error that fails the mapping operation, then we could do it all in there. In the normal case it just does the RPCIT but if that returns that the hypervisor ran out of resources it does another global RPCIT allowing the hypervisor to free IOVAs that were lazily unmapped. If the latter succeeds all is good if not then the mapping operation failed. Logically it makes sense too, .iotlb_sync_map is the final step of syncing the mapping to the host which can fail just like the mapping operation itself. Apart from the out of active IOVAs case this would also handle the other useful error case when using .iotlb_sync_map for shadowing where it fails because the host ran against a pinned pages limit or out of actual memory. Not by fixing it but at least we would get a failed mapping operation. The other callbacks .flush_iotlb_all and .iotlb_sync could stay the same as they are only used for unmapped pages where we can't reasonably run out of resources in the host neither active IOVAs nor pinned pages.
On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote: > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote: > > On 2022-11-29 12:00, Niklas Schnelle wrote: > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: > > > > On 2022-11-16 17:16, Niklas Schnelle wrote: > > > > > When RPCIT indicates that the underlying hypervisor has run out of > > > > > resources it often means that its IOVA space is exhausted and IOVAs need > > > > > to be freed before new ones can be created. By triggering a flush of the > > > > > IOVA queue we can get the queued IOVAs freed and also get the new > > > > > mapping established during the global flush. > > > > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is > > > > exhausted and fail the DMA API call before even getting as far as > > > > iommu_map(), though? Or is there some less obvious limitation like a > > > > maximum total number of distinct IOVA regions regardless of size? > > > > > > Well, yes and no. Your thinking is of course correct if the advertised > > > available IOVA space can be fully utilized without exhausting > > > hypervisor resources we won't trigger this case. However sadly there > > > are complications. The most obvious being that in QEMU/KVM the > > > restriction of the IOVA space to what QEMU can actually have mapped at > > > once was just added recently[0] prior to that we would regularly go > > > through this "I'm out of resources free me some IOVAs" dance with our > > > existing DMA API implementation where this just triggers an early cycle > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know > > > of no situations where this is triggered. That said this signalling is > > > architected so z/VM may have corner cases where it does this. On our > > > bare metal hypervisor (no paging) this return code is unused and IOTLB > > > flushes are simply hardware cache flushes as on bare metal platforms. > > > > > > [0] > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ > > > > That sheds a bit more light, thanks, although I'm still not confident I > > fully understand the whole setup. AFAICS that patch looks to me like > > it's putting a fixed limit on the size of the usable address space. That > > in turn implies that "free some IOVAs and try again" might be a red > > herring and never going to work; for your current implementation, what > > that presumably means in reality is "free some IOVAs, resetting the > > allocator to start allocating lower down in the address space where it > > will happen to be below that limit, and try again", but the iommu-dma > > allocator won't do that. If it doesn't know that some arbitrary range > > below the top of the driver-advertised aperture is unusable, it will > > just keep allocating IOVAs up there and mappings will always fail. > > > > If the driver can't accurately represent the usable IOVA space via the > > aperture and/or reserved regions, then this whole approach seems doomed. > > If on the other hand I've misunderstood and you can actually still use > > any address, just not all of them at the same time, > > > This is exactly it, the problem is a limit on the number of IOVAs that > are concurrently mapped. In QEMU pass-through the tightest limit is > usually the one set by the host kernel parameter > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once > there is a flush queue (including the existing per-CPU one) where each > entry may keep many pages lazily unmapped this is easly hit with fio > bandwidth tests on an NVMe. For this case this patch works reliably > because of course the number of actually active mappings without the > lazily freed ones is similar to the number of active ones with > IOMMU_DOMAIN_DMA. > > > then it might in > > fact be considerably easier to skip the flush queue mechanism entirely > > and implement this internally to the driver - basically make .iotlb_sync > > a no-op for non-strict DMA domains, > > I'm assuming you mean .iotlb_sync_map above. > > > put the corresponding RPCIT flush > > and retry in .sync_map, then allow that to propagate an error back to > > iommu_map() if the new mapping still hasn't taken. > > > > Thanks, > > Robin. > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily > unmapped and thus still block their re-use but free their host > resources via a global RPCIT allowing the guest to use a different > porition of the IOVA space with those resources. It could work, though > I need to test it, but it feels a bit clunky. > > Maybe we can go cleaner while using this idea of not having to flush > the queue but just freeing their host side resources. If we allowed > .iotlb_sync_map to return an error that fails the mapping operation, > then we could do it all in there. In the normal case it just does the > RPCIT but if that returns that the hypervisor ran out of resources it > does another global RPCIT allowing the hypervisor to free IOVAs that > were lazily unmapped. If the latter succeeds all is good if not then > the mapping operation failed. Logically it makes sense too, > .iotlb_sync_map is the final step of syncing the mapping to the host > which can fail just like the mapping operation itself. > > Apart from the out of active IOVAs case this would also handle the > other useful error case when using .iotlb_sync_map for shadowing where > it fails because the host ran against a pinned pages limit or out of > actual memory. Not by fixing it but at least we would get a failed > mapping operation. > > The other callbacks .flush_iotlb_all and .iotlb_sync > could stay the same as they are only used for unmapped pages where we > can't reasonably run out of resources in the host neither active IOVAs > nor pinned pages. > Ok, I've done some testing with the above idea and this seems to work great. I've verified that my version of QEMU (without Matt's IOVA aperture resrtriction patch) creates the RPCIT out of resource indications and then the global flush in .iotlb_sync_map is triggered and allows QEMU to unpin pages and free IOVAs while the guest still has them lazily unpapeg (sitting in the flush queue) and thus uses different IOVAs. @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it can return and error and then failing the mapping operation I think this is a great approach. One advantage over the previous approach of flushing the queue isthat this should work for the pure IOMMU API too. If you don't want to change the signature of .iotlb_sync_map I think we can do Robin's idea and have .iotlb_sync_map as a no-op and do the RPCIT sync as part of the s390_iommu_map_pages(). This would hide what really is our variant of .iotlb_sync_map in the mapping code though which I don't super like. Besides that it would also cause more RPCITs in __iommu_map_sg() as we could no longer use a single RPCIT for the entire range. Thanks, Niklas
On Fri, Dec 02, 2022 at 03:29:50PM +0100, Niklas Schnelle wrote: > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it > can return and error and then failing the mapping operation I think > this is a great approach. One advantage over the previous approach of > flushing the queue isthat this should work for the pure IOMMU API too. I think it is pretty important that the "pure" IOMMU API work with whatever your solution its, the iommu_domain implementation in a driver should not be sensitive to if the dma-iommu.c is operating the domain or something else. Jason
On Fri, 2022-12-02 at 10:42 -0400, Jason Gunthorpe wrote: > On Fri, Dec 02, 2022 at 03:29:50PM +0100, Niklas Schnelle wrote: > > > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it > > can return and error and then failing the mapping operation I think > > this is a great approach. One advantage over the previous approach of > > flushing the queue isthat this should work for the pure IOMMU API too. > > I think it is pretty important that the "pure" IOMMU API work with > whatever your solution its, the iommu_domain implementation in a > driver should not be sensitive to if the dma-iommu.c is operating the > domain or something else. > > Jason Agree. At the moment, i.e. with current mainline and even with the IOMMU improvements in Joerg's queue, the IOMMU APIdoesn't work when the hypervisor returns out-of-resource from the IOTLB flush (RPCIT). This is currently only handled correctly in the arch/s390/pci/pci_dma.c DMA API implementation. I think both handling this with a hidden/inlined .iotlb_sync_map in s390_iommu_map_pages() or with an enhanced .iotlb_sync_map that then must be able to return an error will fix this shortcoming. The latter would be something like this commit I'm currently testing with: https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/commit/?h=dma_iommu_v3&id=a2aecd821fe3c5e2549946a68d8b07e05b288a9b Thanks, Niklas
On Fri, Dec 02, 2022 at 04:12:50PM +0100, Niklas Schnelle wrote:
> https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/commit/?h=dma_iommu_v3&id=a2aecd821fe3c5e2549946a68d8b07e05b288a9b
This patch makes sense to me, if the sync-map is optimally a
hypervisor call and the hypervisor is allowed to fail it, then
propagating the failure seems necessary.
Jason
On 2022-12-02 14:29, Niklas Schnelle wrote: > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote: >> On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote: >>> On 2022-11-29 12:00, Niklas Schnelle wrote: >>>> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: >>>>> On 2022-11-16 17:16, Niklas Schnelle wrote: >>>>>> When RPCIT indicates that the underlying hypervisor has run out of >>>>>> resources it often means that its IOVA space is exhausted and IOVAs need >>>>>> to be freed before new ones can be created. By triggering a flush of the >>>>>> IOVA queue we can get the queued IOVAs freed and also get the new >>>>>> mapping established during the global flush. >>>>> >>>>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is >>>>> exhausted and fail the DMA API call before even getting as far as >>>>> iommu_map(), though? Or is there some less obvious limitation like a >>>>> maximum total number of distinct IOVA regions regardless of size? >>>> >>>> Well, yes and no. Your thinking is of course correct if the advertised >>>> available IOVA space can be fully utilized without exhausting >>>> hypervisor resources we won't trigger this case. However sadly there >>>> are complications. The most obvious being that in QEMU/KVM the >>>> restriction of the IOVA space to what QEMU can actually have mapped at >>>> once was just added recently[0] prior to that we would regularly go >>>> through this "I'm out of resources free me some IOVAs" dance with our >>>> existing DMA API implementation where this just triggers an early cycle >>>> of freeing all unused IOVAs followed by a global flush. On z/VM I know >>>> of no situations where this is triggered. That said this signalling is >>>> architected so z/VM may have corner cases where it does this. On our >>>> bare metal hypervisor (no paging) this return code is unused and IOTLB >>>> flushes are simply hardware cache flushes as on bare metal platforms. >>>> >>>> [0] >>>> https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ >>> >>> That sheds a bit more light, thanks, although I'm still not confident I >>> fully understand the whole setup. AFAICS that patch looks to me like >>> it's putting a fixed limit on the size of the usable address space. That >>> in turn implies that "free some IOVAs and try again" might be a red >>> herring and never going to work; for your current implementation, what >>> that presumably means in reality is "free some IOVAs, resetting the >>> allocator to start allocating lower down in the address space where it >>> will happen to be below that limit, and try again", but the iommu-dma >>> allocator won't do that. If it doesn't know that some arbitrary range >>> below the top of the driver-advertised aperture is unusable, it will >>> just keep allocating IOVAs up there and mappings will always fail. >>> >>> If the driver can't accurately represent the usable IOVA space via the >>> aperture and/or reserved regions, then this whole approach seems doomed. >>> If on the other hand I've misunderstood and you can actually still use >>> any address, just not all of them at the same time, >> >> >> This is exactly it, the problem is a limit on the number of IOVAs that >> are concurrently mapped. In QEMU pass-through the tightest limit is >> usually the one set by the host kernel parameter >> vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With >> IOMMU_DOMAIN_DMA we stay under this limit without extra action but once >> there is a flush queue (including the existing per-CPU one) where each >> entry may keep many pages lazily unmapped this is easly hit with fio >> bandwidth tests on an NVMe. For this case this patch works reliably >> because of course the number of actually active mappings without the >> lazily freed ones is similar to the number of active ones with >> IOMMU_DOMAIN_DMA. >> >>> then it might in >>> fact be considerably easier to skip the flush queue mechanism entirely >>> and implement this internally to the driver - basically make .iotlb_sync >>> a no-op for non-strict DMA domains, >> >> I'm assuming you mean .iotlb_sync_map above. No, I did mean .iotlb_sync, however on reflection that was under the assumption that it's OK for the hypervisor to see a new mapping for a previously-used IOVA without having seen it explicitly unmapped in between. Fair enough if that isn't the case, but if it is then your pagetable can essentially act as the "flush queue" by itself. >>> put the corresponding RPCIT flush >>> and retry in .sync_map, then allow that to propagate an error back to >>> iommu_map() if the new mapping still hasn't taken. >>> >>> Thanks, >>> Robin. >> >> Hmm, interesting. This would leave the IOVAs in the flush queue lazily >> unmapped and thus still block their re-use but free their host >> resources via a global RPCIT allowing the guest to use a different >> porition of the IOVA space with those resources. It could work, though >> I need to test it, but it feels a bit clunky. >> >> Maybe we can go cleaner while using this idea of not having to flush >> the queue but just freeing their host side resources. If we allowed >> .iotlb_sync_map to return an error that fails the mapping operation, >> then we could do it all in there. In the normal case it just does the >> RPCIT but if that returns that the hypervisor ran out of resources it >> does another global RPCIT allowing the hypervisor to free IOVAs that >> were lazily unmapped. If the latter succeeds all is good if not then >> the mapping operation failed. Logically it makes sense too, >> .iotlb_sync_map is the final step of syncing the mapping to the host >> which can fail just like the mapping operation itself. >> >> Apart from the out of active IOVAs case this would also handle the >> other useful error case when using .iotlb_sync_map for shadowing where >> it fails because the host ran against a pinned pages limit or out of >> actual memory. Not by fixing it but at least we would get a failed >> mapping operation. >> >> The other callbacks .flush_iotlb_all and .iotlb_sync >> could stay the same as they are only used for unmapped pages where we >> can't reasonably run out of resources in the host neither active IOVAs >> nor pinned pages. >> > > Ok, I've done some testing with the above idea and this seems to work > great. I've verified that my version of QEMU (without Matt's IOVA > aperture resrtriction patch) creates the RPCIT out of resource > indications and then the global flush in .iotlb_sync_map is triggered > and allows QEMU to unpin pages and free IOVAs while the guest still has > them lazily unpapeg (sitting in the flush queue) and thus uses > different IOVAs. > > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it > can return and error and then failing the mapping operation I think > this is a great approach. One advantage over the previous approach of > flushing the queue isthat this should work for the pure IOMMU API too. Whatever happens I think allowing .iotlb_sync_map to propagate an error out through iommu_map() is an appropriate thing to do - it sounds like s390 might technically need that for regular IOMMU API correctness in some circumstances anyway. Besides, even in the cases where it represents "simple" TLB maintenance, there are potentially ways that could fail (like timing out if the IOMMU has gone completely wrong), so it wouldn't seem entirely unreasonable if a driver might want to report overall failure if it can't guarantee that the new mapping will actually be usable. Thanks, Robin. > If you don't want to change the signature of .iotlb_sync_map I think we > can do Robin's idea and have .iotlb_sync_map as a no-op and do the > RPCIT sync as part of the s390_iommu_map_pages(). This would hide what > really is our variant of .iotlb_sync_map in the mapping code though > which I don't super like. Besides that it would also cause more RPCITs > in __iommu_map_sg() as we could no longer use a single RPCIT for the > entire range. > > Thanks, > Niklas
On Mon, 2022-12-05 at 18:24 +0000, Robin Murphy wrote: > On 2022-12-02 14:29, Niklas Schnelle wrote: > > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote: > > > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote: > > > > On 2022-11-29 12:00, Niklas Schnelle wrote: > > > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: > > > > > > On 2022-11-16 17:16, Niklas Schnelle wrote: > > > > > > > When RPCIT indicates that the underlying hypervisor has run out of > > > > > > > resources it often means that its IOVA space is exhausted and IOVAs need > > > > > > > to be freed before new ones can be created. By triggering a flush of the > > > > > > > IOVA queue we can get the queued IOVAs freed and also get the new > > > > > > > mapping established during the global flush. > > > > > > > > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is > > > > > > exhausted and fail the DMA API call before even getting as far as > > > > > > iommu_map(), though? Or is there some less obvious limitation like a > > > > > > maximum total number of distinct IOVA regions regardless of size? > > > > > > > > > > Well, yes and no. Your thinking is of course correct if the advertised > > > > > available IOVA space can be fully utilized without exhausting > > > > > hypervisor resources we won't trigger this case. However sadly there > > > > > are complications. The most obvious being that in QEMU/KVM the > > > > > restriction of the IOVA space to what QEMU can actually have mapped at > > > > > once was just added recently[0] prior to that we would regularly go > > > > > through this "I'm out of resources free me some IOVAs" dance with our > > > > > existing DMA API implementation where this just triggers an early cycle > > > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know > > > > > of no situations where this is triggered. That said this signalling is > > > > > architected so z/VM may have corner cases where it does this. On our > > > > > bare metal hypervisor (no paging) this return code is unused and IOTLB > > > > > flushes are simply hardware cache flushes as on bare metal platforms. > > > > > > > > > > [0] > > > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/ > > > > > > > > That sheds a bit more light, thanks, although I'm still not confident I > > > > fully understand the whole setup. AFAICS that patch looks to me like > > > > it's putting a fixed limit on the size of the usable address space. That > > > > in turn implies that "free some IOVAs and try again" might be a red > > > > herring and never going to work; for your current implementation, what > > > > that presumably means in reality is "free some IOVAs, resetting the > > > > allocator to start allocating lower down in the address space where it > > > > will happen to be below that limit, and try again", but the iommu-dma > > > > allocator won't do that. If it doesn't know that some arbitrary range > > > > below the top of the driver-advertised aperture is unusable, it will > > > > just keep allocating IOVAs up there and mappings will always fail. > > > > > > > > If the driver can't accurately represent the usable IOVA space via the > > > > aperture and/or reserved regions, then this whole approach seems doomed. > > > > If on the other hand I've misunderstood and you can actually still use > > > > any address, just not all of them at the same time, > > > > > > > > > This is exactly it, the problem is a limit on the number of IOVAs that > > > are concurrently mapped. In QEMU pass-through the tightest limit is > > > usually the one set by the host kernel parameter > > > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With > > > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once > > > there is a flush queue (including the existing per-CPU one) where each > > > entry may keep many pages lazily unmapped this is easly hit with fio > > > bandwidth tests on an NVMe. For this case this patch works reliably > > > because of course the number of actually active mappings without the > > > lazily freed ones is similar to the number of active ones with > > > IOMMU_DOMAIN_DMA. > > > > > > > then it might in > > > > fact be considerably easier to skip the flush queue mechanism entirely > > > > and implement this internally to the driver - basically make .iotlb_sync > > > > a no-op for non-strict DMA domains, > > > > > > I'm assuming you mean .iotlb_sync_map above. > > No, I did mean .iotlb_sync, however on reflection that was under the > assumption that it's OK for the hypervisor to see a new mapping for a > previously-used IOVA without having seen it explicitly unmapped in > between. Fair enough if that isn't the case, but if it is then your > pagetable can essentially act as the "flush queue" by itself. Hmm, I see. We do want the hypervisor to see mappings as invalid before they are changed again to a new valid mapping. In case of e.g. QEMU/KVM this allows the hypervisor to itself rely on the hardware to fill in the IOTLB on map i.e. use a no-op .iotlb_sync_map and a .iotlb_sync that flushes the hardware IOTLB. Also this allows QEMU to emulate the IOMMU with just map/unmap primitives on vfio/iommufd. Consequently, I recently found that vfio-pci pass-through works in combination with an emulated s390 guest on an x86_64 host albeit very slowly of course. > > > > > put the corresponding RPCIT flush > > > > and retry in .sync_map, then allow that to propagate an error back to > > > > iommu_map() if the new mapping still hasn't taken. > > > > > > > > Thanks, > > > > Robin. > > > > > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily > > > unmapped and thus still block their re-use but free their host > > > resources via a global RPCIT allowing the guest to use a different > > > porition of the IOVA space with those resources. It could work, though > > > I need to test it, but it feels a bit clunky. > > > > > > Maybe we can go cleaner while using this idea of not having to flush > > > the queue but just freeing their host side resources. If we allowed > > > .iotlb_sync_map to return an error that fails the mapping operation, > > > then we could do it all in there. In the normal case it just does the > > > RPCIT but if that returns that the hypervisor ran out of resources it > > > does another global RPCIT allowing the hypervisor to free IOVAs that > > > were lazily unmapped. If the latter succeeds all is good if not then > > > the mapping operation failed. Logically it makes sense too, > > > .iotlb_sync_map is the final step of syncing the mapping to the host > > > which can fail just like the mapping operation itself. > > > > > > Apart from the out of active IOVAs case this would also handle the > > > other useful error case when using .iotlb_sync_map for shadowing where > > > it fails because the host ran against a pinned pages limit or out of > > > actual memory. Not by fixing it but at least we would get a failed > > > mapping operation. > > > > > > The other callbacks .flush_iotlb_all and .iotlb_sync > > > could stay the same as they are only used for unmapped pages where we > > > can't reasonably run out of resources in the host neither active IOVAs > > > nor pinned pages. > > > > > > > Ok, I've done some testing with the above idea and this seems to work > > great. I've verified that my version of QEMU (without Matt's IOVA > > aperture resrtriction patch) creates the RPCIT out of resource > > indications and then the global flush in .iotlb_sync_map is triggered > > and allows QEMU to unpin pages and free IOVAs while the guest still has > > them lazily unpapeg (sitting in the flush queue) and thus uses > > different IOVAs. > > > > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it > > can return and error and then failing the mapping operation I think > > this is a great approach. One advantage over the previous approach of > > flushing the queue isthat this should work for the pure IOMMU API too. > > Whatever happens I think allowing .iotlb_sync_map to propagate an error > out through iommu_map() is an appropriate thing to do - it sounds like > s390 might technically need that for regular IOMMU API correctness in > some circumstances anyway. Besides, even in the cases where it > represents "simple" TLB maintenance, there are potentially ways that > could fail (like timing out if the IOMMU has gone completely wrong), so > it wouldn't seem entirely unreasonable if a driver might want to report > overall failure if it can't guarantee that the new mapping will actually > be usable. > > Thanks, > Robin. > Great then I'll use this approach for v3 or maybe even sent this separately as a prerequisite IOMMU fix as yes this is also required for the IOMMU API to work in a guest where the hypervisor may report running out of resources. >
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 3801cdf11aa8..54e7f63fd0d9 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie) spin_unlock_irqrestore(&fq->lock, flags); } -static void fq_flush_timeout(struct timer_list *t) +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie) { - struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); - - atomic_set(&cookie->fq_timer_on, 0); fq_flush_iotlb(cookie); - if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ) fq_flush_percpu(cookie); else fq_flush_single(cookie); } +static void fq_flush_timeout(struct timer_list *t) +{ + struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); + + atomic_set(&cookie->fq_timer_on, 0); + iommu_dma_flush_fq(cookie); +} + static void queue_iova(struct iommu_dma_cookie *cookie, unsigned long pfn, unsigned long pages, struct list_head *freelist) diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h index 942790009292..cac06030aa26 100644 --- a/drivers/iommu/dma-iommu.h +++ b/drivers/iommu/dma-iommu.h @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain); void iommu_put_dma_cookie(struct iommu_domain *domain); int iommu_dma_init_fq(struct iommu_domain *domain); +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie); void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c index 087bb2acff30..9c2782c4043e 100644 --- a/drivers/iommu/s390-iommu.c +++ b/drivers/iommu/s390-iommu.c @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain, { struct s390_domain *s390_domain = to_s390_domain(domain); struct zpci_dev *zdev; + int rc; rcu_read_lock(); list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) { if (!zdev->tlb_refresh) continue; atomic64_inc(&s390_domain->ctrs.sync_map_rpcits); - zpci_refresh_trans((u64)zdev->fh << 32, - iova, size); + rc = zpci_refresh_trans((u64)zdev->fh << 32, + iova, size); + if (rc == -ENOMEM) + iommu_dma_flush_fq(domain->iova_cookie); } rcu_read_unlock(); }