Message ID | 20230102093452.761185-2-schnelle@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp4085581wrt; Mon, 2 Jan 2023 01:35:35 -0800 (PST) X-Google-Smtp-Source: AMrXdXuiHbe6SvHPlXvbaq98XnUq4WN/ZPLsvEBEX/oXtwWe4wgMdNAvbLlTMGdnbtokE4aDsxMl X-Received: by 2002:a05:6a20:4291:b0:b0:47e7:6cba with SMTP id o17-20020a056a20429100b000b047e76cbamr57900378pzj.46.1672652134717; Mon, 02 Jan 2023 01:35:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672652134; cv=none; d=google.com; s=arc-20160816; b=HVJh7BxxSs9VlRvbuNXFGovMHhVnQvkm3WnFTpiTqLz34Xzzoc3LxnTcx26mTMdUkz 4K7e2HI2Ho+dzhKXKBlsBpjp4rAy6iM/sDUAQ0UsUNr/UZppo+fZOdUYyhSv6zBy9EPm A3RgOA5cvs/39z6htREJ7OwVoz93ps29JbfNq2ejyzSO1byQW6PGdhLuyVsSHDfeMT7m UCcAp/t4K9YQNp7Aumfwxr5MmlMVS155mHrx/igvOJ2Oei5Mp9OaancGVlWJEBravXEz 3swcKK+faSscJoxyfgdvoG33NrM87BijAU0nzstoz0ZoeiPLbO9mG33fW/qGEcYKqZvX goHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=0sH2OClJMuiQ+Q3697XUzelUOPV7noQzi+3Z+aWXozA=; b=BTJI2IfifzzKhYTjFny0ZEz0CORjhNjj7nqS2D2MpFo4lIpytbUdT4NurU8r6g62vi 4CHg9b5BWv02DSTxPB+vtBYzkirUcy0uWy2jWJmbQxjWOvWB1NHIN2tFNVN9RGtve0R9 NhcHGu8oWCFyEI8AHIsZG6AVppaAMs4FcPCQm0CWFvMArRmcmqzlgHZeEg5++qfVjGpf QmhpxrC8jKCB/GWp0HXKOPDbBJjrJqi0wlqkcj/IE1FypmoVvF4qObUoD/Y7eNHFroR/ /d9T6w5eWVgh8YEd37AgxywuR1THOKL3BrCsWgOoaY8YGyJ+mP/pU1FE3BU9dAJU0zJ+ hfug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=hc17bzfS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m195-20020a633fcc000000b00477a32da0a9si30142158pga.455.2023.01.02.01.35.22; Mon, 02 Jan 2023 01:35:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=hc17bzfS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231624AbjABJfC (ORCPT <rfc822;cscallsign@gmail.com> + 99 others); Mon, 2 Jan 2023 04:35:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42586 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230161AbjABJfA (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 2 Jan 2023 04:35:00 -0500 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C903E2DEA; Mon, 2 Jan 2023 01:34:59 -0800 (PST) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3027vqTB023461; Mon, 2 Jan 2023 09:34:59 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=0sH2OClJMuiQ+Q3697XUzelUOPV7noQzi+3Z+aWXozA=; b=hc17bzfSEH/73n9N3pG/VH0OwnOOa7SecEKz6rPhcjSgM43QBJ+U2KqB6RDy1bNRw5Qq EJXl/qKrvpoMJFs6OsukytDOTPECyyKlsgHUd4F1J3/Bn8xJymonzsw+oIam+6Cd8w7b V5aM+hW0F9CIcVMC+/8vRhG72Ielq2aTbf2UQhUdzfu1NGxEK5XsDqOh/AR9v8pPXTGj a/jtpMQMR3UGcMyzkl4dWEUro1qgG5IXdhWEynJ6jOb+ITTbEFJOoDWyUMSpRW4wpb8a YJNiy8olmb61rL6p8T0UdTHGboxVka17vdglL4u0m2GCbyDuK0BJDNBxXYQDZtvbqK/4 Aw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3mtxrk5k5c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Jan 2023 09:34:58 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 30290OSK006031; Mon, 2 Jan 2023 09:34:58 GMT Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3mtxrk5k4v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Jan 2023 09:34:58 +0000 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 301LdM9m010837; Mon, 2 Jan 2023 09:34:56 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma03fra.de.ibm.com (PPS) with ESMTPS id 3mtcq6hmjt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Jan 2023 09:34:56 +0000 Received: from smtpav03.fra02v.mail.ibm.com (smtpav03.fra02v.mail.ibm.com [10.20.54.102]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3029YqQ846530832 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 2 Jan 2023 09:34:52 GMT Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BA3F320040; Mon, 2 Jan 2023 09:34:52 +0000 (GMT) Received: from smtpav03.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 956CF2004D; Mon, 2 Jan 2023 09:34:52 +0000 (GMT) Received: from tuxmaker.boeblingen.de.ibm.com (unknown [9.152.85.9]) by smtpav03.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 2 Jan 2023 09:34:52 +0000 (GMT) From: Niklas Schnelle <schnelle@linux.ibm.com> To: Alex Williamson <alex.williamson@redhat.com>, Cornelia Huck <cohuck@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, Matthew Rosato <mjrosato@linux.ibm.com>, Pierre Morel <pmorel@linux.ibm.com>, =?utf-8?q?Christian_Borntr=C3=A4ger?= <borntraeger@linux.ibm.com> Subject: [PATCH 1/1] vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp() Date: Mon, 2 Jan 2023 10:34:52 +0100 Message-Id: <20230102093452.761185-2-schnelle@linux.ibm.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230102093452.761185-1-schnelle@linux.ibm.com> References: <20230102093452.761185-1-schnelle@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 07p5n3680Fn2fYulR3YsSrKIJk7VNA_1 X-Proofpoint-ORIG-GUID: 3CS8UoRij973fcADqOuihuhWfA0a1F9j X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.923,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2023-01-02_05,2022-12-30_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 bulkscore=0 phishscore=0 spamscore=0 impostorscore=0 suspectscore=0 mlxlogscore=999 mlxscore=0 priorityscore=1501 malwarescore=0 clxscore=1015 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2301020086 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1753902884878507143?= X-GMAIL-MSGID: =?utf-8?q?1753902884878507143?= |
Series |
vfio/type1: Fix vfio-pci pass-through of ISM devices
|
|
Commit Message
Niklas Schnelle
Jan. 2, 2023, 9:34 a.m. UTC
Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain
aperture shrinking") the s390 IOMMU driver uses a reserved region
instead of an artificially shrunk aperture to restrict IOMMU use based
on the system provided DMA ranges of devices. In particular on current
machines this prevents use of DMA addresses below 2^32 for all devices.
While usually just IOMMU mapping below these addresses is
harmless. However our virtual ISM PCI device looks at new mappings on
IOTLB flush and immediately goes into the error state if such a mapping
violates its allowed DMA ranges. This then breaks pass-through of the
ISM device to a KVM guest.
Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA
address 0 irrespective of the IOMMUs reserved regions. Even if usually
harmless this seems wrong in the general case so instead go through the
freshly updated IOVA list and try to find a range that isn't reserved
and fits 2 pages and use that for testing for fine grained super pages.
Fixes: 6fe1010d6d9c ("vfio/type1: DMA unmap chunking")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
drivers/vfio/vfio_iommu_type1.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
Comments
On Mon, Jan 02, 2023 at 10:34:52AM +0100, Niklas Schnelle wrote: > Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain > aperture shrinking") the s390 IOMMU driver uses a reserved region > instead of an artificially shrunk aperture to restrict IOMMU use based > on the system provided DMA ranges of devices. In particular on current > machines this prevents use of DMA addresses below 2^32 for all devices. > While usually just IOMMU mapping below these addresses is > harmless. However our virtual ISM PCI device looks at new mappings on > IOTLB flush and immediately goes into the error state if such a mapping > violates its allowed DMA ranges. This then breaks pass-through of the > ISM device to a KVM guest. > > Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA > address 0 irrespective of the IOMMUs reserved regions. Even if usually > harmless this seems wrong in the general case so instead go through the > freshly updated IOVA list and try to find a range that isn't reserved > and fits 2 pages and use that for testing for fine grained super pages. Why does it matter? The s390 driver will not set fgsp=true, so if it fails because map fails or does a proper detection it shouldn't make a difference. IOW how does this actualy manifest into a failure? > - if (!ret) { > - size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE); > + list_for_each_entry(region, regions, list) { > + if (region->end - region->start < PAGE_SIZE * 2) > + continue; > > - if (unmapped == PAGE_SIZE) > - iommu_unmap(domain->domain, PAGE_SIZE, PAGE_SIZE); > - else > - domain->fgsp = true; > + ret = iommu_map(domain->domain, region->start, page_to_phys(pages), PAGE_SIZE * 2, > + IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE); The region also needs to have 'region->start % (PAGE_SIZE*2) == 0' for the test to work Jason
On Tue, 2023-01-03 at 19:39 -0400, Jason Gunthorpe wrote: > On Mon, Jan 02, 2023 at 10:34:52AM +0100, Niklas Schnelle wrote: > > Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain > > aperture shrinking") the s390 IOMMU driver uses a reserved region > > instead of an artificially shrunk aperture to restrict IOMMU use based > > on the system provided DMA ranges of devices. In particular on current > > machines this prevents use of DMA addresses below 2^32 for all devices. > > While usually just IOMMU mapping below these addresses is > > harmless. However our virtual ISM PCI device looks at new mappings on > > IOTLB flush and immediately goes into the error state if such a mapping > > violates its allowed DMA ranges. This then breaks pass-through of the > > ISM device to a KVM guest. > > > > Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA > > address 0 irrespective of the IOMMUs reserved regions. Even if usually > > harmless this seems wrong in the general case so instead go through the > > freshly updated IOVA list and try to find a range that isn't reserved > > and fits 2 pages and use that for testing for fine grained super pages. > > Why does it matter? The s390 driver will not set fgsp=true, so if it > fails because map fails or does a proper detection it shouldn't make a > difference. > > IOW how does this actualy manifest into a failure? Oh, yeah I agree that's what I meant by saying that just mapping should usually be harmless. This is indeedthe case for all normal PCI devices on s390 there it doesn't matter. The problem manifests only with ISM devices which are a special s390 virtual PCI device that is implemented in the machine hypervisor. This device is used for high speed cross-LPAR (Logical Partition) communication, basically it allows two LPARs that previously exchanged an authentication token to memcpy between their partitioned memory using the virtual device. For copying a receiving LPAR will IOMMU map a region of memory for the ISM device that it will allow DMAing into (memcpy by the hypervisor). All other regions remain unmapped and thus inaccessible. In preparation the device emulation in the machine hypervisor intercepts the IOTLB flush and looks at the IOMMU translation tables performing e.g. size and alignment checks I presume, one of these checks against the start/end DMA boundaries. This check fails which leads to the virtual ISM device being put into an error state. Being in an error state it then fails to be initialized by the guest driver later on. > > > - if (!ret) { > > - size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE); > > + list_for_each_entry(region, regions, list) { > > + if (region->end - region->start < PAGE_SIZE * 2) > > + continue; > > > > - if (unmapped == PAGE_SIZE) > > - iommu_unmap(domain->domain, PAGE_SIZE, PAGE_SIZE); > > - else > > - domain->fgsp = true; > > + ret = iommu_map(domain->domain, region->start, page_to_phys(pages), PAGE_SIZE * 2, > > + IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE); > > The region also needs to have 'region->start % (PAGE_SIZE*2) == 0' for the > test to work > > Jason Ah okay makes sense, I guess that check could easily be added.
On Wed, Jan 04, 2023 at 10:52:55AM +0100, Niklas Schnelle wrote: > The problem manifests only with ISM devices which are a special s390 > virtual PCI device that is implemented in the machine hypervisor. This > device is used for high speed cross-LPAR (Logical Partition) > communication, basically it allows two LPARs that previously exchanged > an authentication token to memcpy between their partitioned memory > using the virtual device. For copying a receiving LPAR will IOMMU map a > region of memory for the ISM device that it will allow DMAing into > (memcpy by the hypervisor). All other regions remain unmapped and thus > inaccessible. In preparation the device emulation in the machine > hypervisor intercepts the IOTLB flush and looks at the IOMMU > translation tables performing e.g. size and alignment checks I presume, > one of these checks against the start/end DMA boundaries. This check > fails which leads to the virtual ISM device being put into an error > state. Being in an error state it then fails to be initialized by the > guest driver later on. You could rephrase this as saying that the S390 map operation doesn't check for bounds so mapping in a reserved region doesn't fail, but errors the HW. Which seems reasonable to me Jason
On Wed, 2023-01-04 at 08:16 -0400, Jason Gunthorpe wrote: > On Wed, Jan 04, 2023 at 10:52:55AM +0100, Niklas Schnelle wrote: > > > The problem manifests only with ISM devices which are a special s390 > > virtual PCI device that is implemented in the machine hypervisor. This > > device is used for high speed cross-LPAR (Logical Partition) > > communication, basically it allows two LPARs that previously exchanged > > an authentication token to memcpy between their partitioned memory > > using the virtual device. For copying a receiving LPAR will IOMMU map a > > region of memory for the ISM device that it will allow DMAing into > > (memcpy by the hypervisor). All other regions remain unmapped and thus > > inaccessible. In preparation the device emulation in the machine > > hypervisor intercepts the IOTLB flush and looks at the IOMMU > > translation tables performing e.g. size and alignment checks I presume, > > one of these checks against the start/end DMA boundaries. This check > > fails which leads to the virtual ISM device being put into an error > > state. Being in an error state it then fails to be initialized by the > > guest driver later on. > > You could rephrase this as saying that the S390 map operation doesn't > check for bounds so mapping in a reserved region doesn't fail, but > errors the HW. > > Which seems reasonable to me > > Jason Kind of yes, before the recent IOMMU changes the IOMMU code did check on map failing early but now handles the limits via reserved regions. The IOMMU hardware would only check the limits once an actual DMA uses them but of course no DMA will be triggered for this test mapping. For this specific virtual device though there is an extra check as part of an intercepted IOTLB flush (RPCIT instruction in S390).
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 23c24fe98c00..9395097897b8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1856,24 +1856,31 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, * significantly boosts non-hugetlbfs mappings and doesn't seem to hurt when * hugetlbfs is in use. */ -static void vfio_test_domain_fgsp(struct vfio_domain *domain) +static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *regions) { - struct page *pages; int ret, order = get_order(PAGE_SIZE * 2); + struct vfio_iova *region; + struct page *pages; pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order); if (!pages) return; - ret = iommu_map(domain->domain, 0, page_to_phys(pages), PAGE_SIZE * 2, - IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE); - if (!ret) { - size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE); + list_for_each_entry(region, regions, list) { + if (region->end - region->start < PAGE_SIZE * 2) + continue; - if (unmapped == PAGE_SIZE) - iommu_unmap(domain->domain, PAGE_SIZE, PAGE_SIZE); - else - domain->fgsp = true; + ret = iommu_map(domain->domain, region->start, page_to_phys(pages), PAGE_SIZE * 2, + IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE); + if (!ret) { + size_t unmapped = iommu_unmap(domain->domain, region->start, PAGE_SIZE); + + if (unmapped == PAGE_SIZE) + iommu_unmap(domain->domain, region->start + PAGE_SIZE, PAGE_SIZE); + else + domain->fgsp = true; + } + break; } __free_pages(pages, order); @@ -2326,7 +2333,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, } } - vfio_test_domain_fgsp(domain); + vfio_test_domain_fgsp(domain, &iova_copy); /* replay mappings on new domains */ ret = vfio_iommu_replay(iommu, domain);