Message ID | cover.1679309810.git.petr.tesarik.ext@huawei.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1200936wrt; Mon, 20 Mar 2023 06:01:18 -0700 (PDT) X-Google-Smtp-Source: AK7set/WXnsgyD9G48LLloF53sGbkj7/W3vu7yzQiiAbXVYLKfwV9gukrsQ5+OERTxH+7FzVDJ+o X-Received: by 2002:a17:902:d292:b0:19f:3617:e9e9 with SMTP id t18-20020a170902d29200b0019f3617e9e9mr14223520plc.43.1679317277912; Mon, 20 Mar 2023 06:01:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679317277; cv=none; d=google.com; s=arc-20160816; b=Gr1ZDO0Vslt/zI0xzD/RfV9Bw0u4Rv+u1vXDBY1A/0d5ZWVOrlV9b3H/4Mog1u9/eR ZynDgQLwapdSeghKrGu/MCG0Yt1RN4/XD/Wb3RVG4tc6D1/XPQqWJZEUv8w30vRgNZhF naNDDj97v2PXt3ogt1owrAaWo/IdXwqcjmd8hF/CdAwvbDDe7nG3NKfQuoUhk1tI5oMd AcWgjPNEqvk2dWIuJZtGv8WBPGa9GCJgeeW3dn5TP86m4qzF1I8TubJORhJj24bz8hl0 7pAk/Y8m4p06izSYMlZ+xEgh7On7Lj88GM3jf1/8UlNLCMyr1W0VMlZkNa6xZOZ44D2b Mueg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=8bNUask/2QqZp+DsuJzBY95ktaJ57AZuwHchq/2nvV8=; b=EokvAsn3E1HcWP7tTOwyF/6NTRfbWIbd6JFZ6GEclTULm0ADBosKYUTwO8ReW4k5cY AdbpHrPA7IMZnNyYOOKm+d1h+lHLM1978ocV0eNZLqTPOcZI4F4Slkasc0nbDRtSbqvN KrfrFNkjy6W3IMjlSg3LXgIF+Lk49Qcb0O4oNFMTCOM6jUO4uwCswmS6a3DhLis0pHOp voxmbLlSgUg4fKQJEOWQghalDm4XBkA8Ww9ovwyF7IaM4SjVgPs5w+VE0r+INuYOWFqh 3/vXiAYLkxCkVpSjvLN6rG5rj6ssgq/DyKugGlFfGSANnEiGbBXbJwVA8YQPqPERGYUa z14w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u7-20020a17090341c700b0019ac7319ecdsi1430632ple.409.2023.03.20.06.00.29; Mon, 20 Mar 2023 06:01:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231360AbjCTM3N (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others); Mon, 20 Mar 2023 08:29:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231343AbjCTM3H (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 20 Mar 2023 08:29:07 -0400 Received: from frasgout11.his.huawei.com (frasgout11.his.huawei.com [14.137.139.23]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D9F793F0; Mon, 20 Mar 2023 05:29:02 -0700 (PDT) Received: from mail02.huawei.com (unknown [172.18.147.229]) by frasgout11.his.huawei.com (SkyGuard) with ESMTP id 4PgDMN1tlqz9y4Sk; Mon, 20 Mar 2023 20:20:00 +0800 (CST) Received: from A2101119013HW2.china.huawei.com (unknown [10.48.148.162]) by APP2 (Coremail) with SMTP id GxC2BwBnOF9kURhkkqGyAQ--.46782S2; Mon, 20 Mar 2023 13:28:33 +0100 (CET) From: Petr Tesarik <petrtesarik@huaweicloud.com> To: Jonathan Corbet <corbet@lwn.net>, Christoph Hellwig <hch@lst.de>, Marek Szyprowski <m.szyprowski@samsung.com>, Robin Murphy <robin.murphy@arm.com>, Borislav Petkov <bp@suse.de>, "Paul E. McKenney" <paulmck@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Randy Dunlap <rdunlap@infradead.org>, Damien Le Moal <damien.lemoal@opensource.wdc.com>, Kim Phillips <kim.phillips@amd.com>, "Steven Rostedt (Google)" <rostedt@goodmis.org>, linux-doc@vger.kernel.org (open list:DOCUMENTATION), linux-kernel@vger.kernel.org (open list), iommu@lists.linux.dev (open list:DMA MAPPING HELPERS) Cc: Roberto Sassu <roberto.sassu@huawei.com>, petr@tesarici.cz Subject: [RFC v1 0/4] Allow dynamic allocation of software IO TLB bounce buffers Date: Mon, 20 Mar 2023 13:28:12 +0100 Message-Id: <cover.1679309810.git.petr.tesarik.ext@huawei.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: GxC2BwBnOF9kURhkkqGyAQ--.46782S2 X-Coremail-Antispam: 1UD129KBjvJXoWxJw18WF1rGrW5XF4kXryxZrb_yoWrXr1DpF Wakw15Zr1DtryxA3yxCw4xXas5Gws5Aay5GFZaqr1UZrW5GFnFvrnrtay5Xa9rGws7Xw1j qryYvrn8CFyvvaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvSb4IE77IF4wAFF20E14v26ryj6rWUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Jr0_JF4l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr0_Cr1l84ACjcxK6I8E87Iv67AKxVW8JVWxJwA2z4x0Y4vEx4A2jsIEc7CjxV AFwI0_Gr1j6F4UJwAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7MxkF7I0E w4C26cxK6c8Ij28IcwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s 026c02F40E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_ GFv_WrylIxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20x vEc7CjxVAFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6rW3Jr0E3s1lIxAIcVC2z280 aVAFwI0_Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43 ZEXa7IU8gyCJUUUUU== X-CM-SenderInfo: hshw23xhvd2x3n6k3tpzhluzxrxghudrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760891793786617481?= X-GMAIL-MSGID: =?utf-8?q?1760891793786617481?= |
Series |
Allow dynamic allocation of software IO TLB bounce buffers
|
|
Message
Petr Tesarik
March 20, 2023, 12:28 p.m. UTC
From: Petr Tesarik <petr.tesarik.ext@huawei.com>
The goal of my work is to provide more flexibility in the sizing of
SWIOTLB. This patch series is a request for comments from the wider
community. The code is more of a crude hack than final solution.
I would appreciate suggestions for measuring the performance impact
of changes in SWIOTLB. More info at the end of this cover letter.
The software IO TLB was designed with these assumptions:
1. It would not be used much, especially on 64-bit systems.
2. A small fixed memory area (64 MiB by default) is sufficient to
handle the few cases which require a bounce buffer.
3. 64 MiB is little enough that it has no impact on the rest of the
system.
First, if SEV is active, all DMA must be done through shared
unencrypted pages, and SWIOTLB is used to make this happen without
changing device drivers. The software IO TLB size is increased to
6% of total memory in sev_setup_arch(), but that is more of an
approximation. The actual requirements may vary depending on the
amount of I/O and which drivers are used. These factors may not be
know at boot time, i.e. when SWIOTLB is allocated.
Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
moved from the rendering GPU (v3d driver), which can access all
memory, to the display output (vc4 driver), which is connected to a
bus with an address limit of 1 GiB and no IOMMU. These buffers can
be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor)
and cannot be even handled by current SWIOTLB, because they exceed
the maximum segment size of 256 KiB. Mapping failures can be
easily reproduced with GNOME remote desktop on a Raspberry Pi 4.
Third, other colleagues have noticed that they can reliably get rid
of occasional OOM kills on an Arm embedded device by reducing the
SWIOTLB size. This can be achieved with a kernel parameter, but
determining the right value puts additional burden on pre-release
testing, which could be avoided if SWIOTLB is allocated small and
grows only when necessary.
I have tried to measure the expected performance degradation so
that I could reduce it and/or compare it to alternative approaches.
I have performed all tests on an otherwise idle Raspberry Pi 4 with
swiotlb=force (which, addmittedly, is a bit artificial). I quickly
ran into trouble.
I ran fio against an ext3 filesystem mounted from a UAS drive. To
my surprise, forcing swiotlb (without my patches) *improved* IOPS
and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no
visible difference for 1M blocks. I also observed smaller minimum
and average completion latencies, and even smaller maximum
latencies for 4K blocks. However, when I ran the tests again later
to verify some oddities, there was a performance drop. It appears
that I/O, bandwidth and latencies reported by two consecutive fio
runs may differ by as much as 10%, so the results are invalid.
I tried to make a micro-benchmark on dma_map_page_attrs() using the
bcc tool funclatency, but just loading the eBPF program was enough
to change the behaviour of the system wildly.
I wonder if anyone can give me advice on measuring SWIOTLB
performance. I can see that AMD, IBM and Microsoft people have
mentioned performance in their patches, but AFAICS without
explaining how it was measured. Knowing a bit more would be much
appreciated.
Petr Tesarik (4):
dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
swiotlb: Move code around in preparation for dynamic bounce buffers
swiotlb: Allow dynamic allocation of bounce buffers
swiotlb: Add an option to allow dynamic bounce buffers
.../admin-guide/kernel-parameters.txt | 6 +-
Documentation/core-api/dma-attributes.rst | 10 +
include/linux/dma-mapping.h | 6 +
include/linux/swiotlb.h | 17 +-
kernel/dma/swiotlb.c | 233 +++++++++++++++---
5 files changed, 241 insertions(+), 31 deletions(-)
Comments
Hello after a week, I'm top-posting, because I'm not replying to anything particular in my original RFC. It seems there are no specific comments from other people either. My understanding is that the general idea of making SWIOTLB grow and shrink as needed is welcome, but nobody can suggest a good method to measure the expected performance impact. FWIW I have seen zero impact if the patches are applied, but not enabled through "swiotlb=dynamic" boot parameter. I believe this is sufficient to start developing dynamic SWIOTLB as an experimental feature. @Christoph: I was kind of expecting you to jump up and tell me that the non-coherent DMA API cannot be used to allocate bounce buffers. ;-) But since that has not happened, I will assume that there is no weird architecture which treats SWIOTLB differently from other DMA-capable regions. Anyway, I have some comments of my own to the v1 RFC series. First, the maple tree cannot be used to track buffers, because it is not safe to use from interrupt context. The maple tree implementation currently allocates new nodes with kmem_cache_alloc_bulk(), which must be called with interrupts enabled. This can be changed in the maple tree code, but I could also pre-allocate nodes from non-irq context, or use another data structure instead. Second, based on discussions with my dear colleague Roberto, I have considered alternative approaches: A. Keep track of SWIOTLB utilization and allocate additional io_tlb_mem structures from a separate kernel thread when necessary. The advantage is that allocations are moved away from the hot path (map/unmap). The disadvantage is that these additional tables would necessarily be small (limited by MAX_ORDER and physical memory fragmentation). Another disadvantage is that such a multi-table SWIOTLB is not very likely to shrink, because a table can be freed only if ALL slots are free. B. Allocate a very big SWIOTLB, but allow to use it for normal allocations (similar to the CMA approach). The advantage is that there is only one table, pushing performance impact down to almost zero. The main challenge is migrating pages to/from the SWIOTLB. Existing CMA code cannot be reused, because CMA cannot be used from atomic contexts, unlike SWIOTLB. For the time being, I propose to start with my RFC series, accepting some performance drop if the feature is explicitly enabled. With more feedback from the field (especially from you, SEV SNP guys), I can work on improving performance. Petr T On 3/20/2023 1:28 PM, Petr Tesarik wrote: > From: Petr Tesarik <petr.tesarik.ext@huawei.com> > > The goal of my work is to provide more flexibility in the sizing of > SWIOTLB. This patch series is a request for comments from the wider > community. The code is more of a crude hack than final solution. > > I would appreciate suggestions for measuring the performance impact > of changes in SWIOTLB. More info at the end of this cover letter. > > The software IO TLB was designed with these assumptions: > > 1. It would not be used much, especially on 64-bit systems. > 2. A small fixed memory area (64 MiB by default) is sufficient to > handle the few cases which require a bounce buffer. > 3. 64 MiB is little enough that it has no impact on the rest of the > system. > > First, if SEV is active, all DMA must be done through shared > unencrypted pages, and SWIOTLB is used to make this happen without > changing device drivers. The software IO TLB size is increased to > 6% of total memory in sev_setup_arch(), but that is more of an > approximation. The actual requirements may vary depending on the > amount of I/O and which drivers are used. These factors may not be > know at boot time, i.e. when SWIOTLB is allocated. > > Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages > moved from the rendering GPU (v3d driver), which can access all > memory, to the display output (vc4 driver), which is connected to a > bus with an address limit of 1 GiB and no IOMMU. These buffers can > be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor) > and cannot be even handled by current SWIOTLB, because they exceed > the maximum segment size of 256 KiB. Mapping failures can be > easily reproduced with GNOME remote desktop on a Raspberry Pi 4. > > Third, other colleagues have noticed that they can reliably get rid > of occasional OOM kills on an Arm embedded device by reducing the > SWIOTLB size. This can be achieved with a kernel parameter, but > determining the right value puts additional burden on pre-release > testing, which could be avoided if SWIOTLB is allocated small and > grows only when necessary. > > I have tried to measure the expected performance degradation so > that I could reduce it and/or compare it to alternative approaches. > I have performed all tests on an otherwise idle Raspberry Pi 4 with > swiotlb=force (which, addmittedly, is a bit artificial). I quickly > ran into trouble. > > I ran fio against an ext3 filesystem mounted from a UAS drive. To > my surprise, forcing swiotlb (without my patches) *improved* IOPS > and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no > visible difference for 1M blocks. I also observed smaller minimum > and average completion latencies, and even smaller maximum > latencies for 4K blocks. However, when I ran the tests again later > to verify some oddities, there was a performance drop. It appears > that I/O, bandwidth and latencies reported by two consecutive fio > runs may differ by as much as 10%, so the results are invalid. > > I tried to make a micro-benchmark on dma_map_page_attrs() using the > bcc tool funclatency, but just loading the eBPF program was enough > to change the behaviour of the system wildly. > > I wonder if anyone can give me advice on measuring SWIOTLB > performance. I can see that AMD, IBM and Microsoft people have > mentioned performance in their patches, but AFAICS without > explaining how it was measured. Knowing a bit more would be much > appreciated. > > Petr Tesarik (4): > dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute > swiotlb: Move code around in preparation for dynamic bounce buffers > swiotlb: Allow dynamic allocation of bounce buffers > swiotlb: Add an option to allow dynamic bounce buffers > > .../admin-guide/kernel-parameters.txt | 6 +- > Documentation/core-api/dma-attributes.rst | 10 + > include/linux/dma-mapping.h | 6 + > include/linux/swiotlb.h | 17 +- > kernel/dma/swiotlb.c | 233 +++++++++++++++--- > 5 files changed, 241 insertions(+), 31 deletions(-) >
On Mon, Mar 27, 2023 at 01:06:34PM +0200, Petr Tesarik wrote: > B. Allocate a very big SWIOTLB, but allow to use it for normal > allocations (similar to the CMA approach). The advantage is that there > is only one table, pushing performance impact down to almost zero. The > main challenge is migrating pages to/from the SWIOTLB. Existing CMA code > cannot be reused, because CMA cannot be used from atomic contexts, > unlike SWIOTLB. That actually sounds very interesting, although I'd go futher and figure out if we: a) could get away to only allow the CMA allocation for sleeping contexts, if we have enough sleeping context to matter b) check with the CMA maintainers if it is feasible and acceptable to them to extent CMA for irq allocations. That being said, I think cases like dma-buf sharing really need to be addressed at a higher level instead of basically double allocating these long-term memory allocations. I'd also really love to hear some feedback from the various confidential computing implementors, as that seems to be ther big driving force for swiotlb currently.