Message ID | 20240228133930.15400-7-will@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp3360920dyb; Wed, 28 Feb 2024 05:59:54 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCWQXOJmRKrN1+O2Bfi0EnCGF4wy8RiuLUvA8Ibw84qwKY1rQBPeRsPiB5eKiofngm4Q2crXWUkdLCwQ8F4L2Lz99iNFeQ== X-Google-Smtp-Source: AGHT+IEsKGnyJcyVSYAazxWCFaUHQJEQ0LO29JGFAQx8/WwrnGFLv9RaFE0sDc8aSAj7+goB0bFy X-Received: by 2002:a05:6a21:3511:b0:1a0:fea5:8405 with SMTP id zc17-20020a056a21351100b001a0fea58405mr5359637pzb.41.1709128794189; Wed, 28 Feb 2024 05:59:54 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709128794; cv=pass; d=google.com; s=arc-20160816; b=yAtHcQ9GaPXVAxJVRYaKLRMUx7NmoU2Y3FyhE3XeISDkiNGX/qPxQHmzjp40L81Tb0 0JlDW1bk2JZ2o6uxiWbKoiqNQZ2Aur/SO+l5uH+aOXVr5CyDSEZ3/lfQ5NGvZ9PHfDNp sH2l1m1GcX66OgYB7ZF0LNdqRcqdVsWZrLGJOE96La4rEp1y2FBR2W6zeQsdcB3jNrgw 7sAfNwdf3d/ZwwAlketyvVQkn0pcLt+Q0GajrXPwdFoYkHUG2KDCHDIw/TrH2CALmazH W8Ro+Jvmk/dLW5PrGw/xym2UDo9rRBcS67g18X3eMBZY03TP0HYtAs7cUmquUFfO3x0n 1lxg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=yKJtWXMUUZDiSXU7ORRKFKfyXQXAcRzYJs0Ixf4Gg2s=; fh=E8F5/vp/nAcPVJWJwVt5RE8c2HME7kVqVeSMFxe8huE=; b=vadzZl2ZOWNxYi0Jf5C43YZ+SzSViE9sC0/bLBMmYfCvUYw0LQkZ+hy7j6lWm8ahPD GkE7F1xjfjVL3DuAx2eYaf/0V19Ahqfda1b+22n4byRWZevB6Wafky63NEfAdxpX2fsc I7KRelSSxG//fa/e+AiTkBuaXWYc6FHB/bTfvSlIBBoJp+R/9IMSso6b52pgo1ZHjZ+1 T5lFVoD7rFKzXJQESNAlPIqgeQPQKaebaE+gjGDVuAaJwBna6m+MPJLjbaBqWtZ3CF7E RUFP486517aKKS+sBIik9LfRk1rrzBtcKknZY2g3ka7/aimUfgX6DqZh55t+UjHHoATl jnxQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ZHQNrkmE; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id e5-20020a656885000000b005d8c55d9669si7430035pgt.366.2024.02.28.05.59.53 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 05:59:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ZHQNrkmE; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85115-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id F0EE2B27A4C for <ouuuleilei@gmail.com>; Wed, 28 Feb 2024 13:41:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 76CF9159580; Wed, 28 Feb 2024 13:39:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ZHQNrkmE" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA2AF158D83; Wed, 28 Feb 2024 13:39:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709127592; cv=none; b=CWtVK5q5VSgRtIiypKjQwA/UYgbBrKEFYF/4eRiuSxYNkjDqmpzUXHQrVdQsAi/BvehnYecJKXgNFkSfUFaukmEa2ZsYJ6Qe5s5HbHBj4aWIUPyKxq22Jf6q9l9agcsxk8jMqp5VB5JyU+ekSa2y3zFgM7wfMtmbDuy3rhpst6U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709127592; c=relaxed/simple; bh=aD3tZMex9G2YUndtdrq37+tQDE+4+Pl1DCirhnA679E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=jTA7KXf2yOI5fF1qjcRs65BuzpXmMpJHT0EAtlB1vci3rlaoQdswlacdQScfCeU0FTUEJOPqwzcsJgFTPPAmv5JJ9WIakXQcMFh4DQJpifhNNZ6k8VhtvRtpovD7ypn69WsJO6G84xYwQGYJDJbKQ94n5E3DPQjFSWqLb2KCKZ4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ZHQNrkmE; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2895CC433A6; Wed, 28 Feb 2024 13:39:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709127592; bh=aD3tZMex9G2YUndtdrq37+tQDE+4+Pl1DCirhnA679E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ZHQNrkmEgy12WEFe8Jv/6qBhkOgpWG2e1degZiyvn0LRnsKBkD8hP4JCuu/CpirRZ E5WF8pyyGcE2r60DIS5WmAChmo8/4oGap1B0RpdsQBkyekKbiqQXkgZisyOndf5F1I CslGRSqiwlNv/3k8HP/jQ3oEozzZ9SRJVXLmjraro47W5uHsqGvgl5iwzEv+p8g86Y SuAOP/Yv0HD5yJun+tFlSvNzuxnwf9mQzayPmRpn9BwYrwTDEu+aLsLwpQoYNHbO3W VugD2V8V/AXwfyavEVwlfOy6PrZtVkypRCqeIldUmtlJVwuBTG8baksKhrDOgEV83W grPbgFR7fX4bQ== From: Will Deacon <will@kernel.org> To: linux-kernel@vger.kernel.org Cc: kernel-team@android.com, Will Deacon <will@kernel.org>, iommu@lists.linux.dev, Christoph Hellwig <hch@lst.de>, Marek Szyprowski <m.szyprowski@samsung.com>, Robin Murphy <robin.murphy@arm.com>, Petr Tesarik <petr.tesarik1@huawei-partners.com>, Dexuan Cui <decui@microsoft.com>, Nicolin Chen <nicolinc@nvidia.com>, Michael Kelley <mhklinux@outlook.com> Subject: [PATCH v5 6/6] swiotlb: Remove pointless stride adjustment for allocations >= PAGE_SIZE Date: Wed, 28 Feb 2024 13:39:30 +0000 Message-Id: <20240228133930.15400-7-will@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20240228133930.15400-1-will@kernel.org> References: <20240228133930.15400-1-will@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792151434552043947 X-GMAIL-MSGID: 1792151434552043947 |
Series |
Fix double allocation in swiotlb_alloc()
|
|
Commit Message
Will Deacon
Feb. 28, 2024, 1:39 p.m. UTC
For swiotlb allocations >= PAGE_SIZE, the slab search historically
adjusted the stride to avoid checking unaligned slots. However, this is
no longer needed now that the code around it has evolved and the
stride is calculated from the required alignment.
Either 'alloc_align_mask' is used to specify the allocation alignment or
the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'.
At least one of these masks is always non-zero.
In light of that, remove the redundant (and slightly confusing) check.
Link: https://lore.kernel.org/r/SN6PR02MB4157089980E6FC58D5557BCED4572@SN6PR02MB4157.namprd02.prod.outlook.com
Reported-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
kernel/dma/swiotlb.c | 7 -------
1 file changed, 7 deletions(-)
Comments
From: Will Deacon <will@kernel.org> Sent: Wednesday, February 28, 2024 5:40 AM > > For swiotlb allocations >= PAGE_SIZE, the slab search historically > adjusted the stride to avoid checking unaligned slots. However, this is > no longer needed now that the code around it has evolved and the > stride is calculated from the required alignment. > > Either 'alloc_align_mask' is used to specify the allocation alignment or > the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'. > At least one of these masks is always non-zero. I think the patch is correct, but this justification is not. alloc_align_mask and the DMA min_align_mask are often both zero. While the NVMe PCI driver sets min_align_mask, SCSI disk drivers do not (except for the Hyper-V synthetic SCSI driver). When both are zero, presumably there are no alignment requirements, so a stride of 1 is appropriate. Michael > > In light of that, remove the redundant (and slightly confusing) check. > > Link: https://lore.kernel.org/all/SN6PR02MB4157089980E6FC58D5557BCED4572@SN6PR02MB4157.namprd02.prod.outlook.com/ > Reported-by: Michael Kelley <mhklinux@outlook.com> > Signed-off-by: Will Deacon <will@kernel.org> > --- > kernel/dma/swiotlb.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index c381a7ed718f..0d8805569f5e 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -1006,13 +1006,6 @@ static int swiotlb_search_pool_area(struct device > *dev, struct io_tlb_pool *pool > */ > stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); > > - /* > - * For allocations of PAGE_SIZE or larger only look for page aligned > - * allocations. > - */ > - if (alloc_size >= PAGE_SIZE) > - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); > - > spin_lock_irqsave(&area->lock, flags); > if (unlikely(nslots > pool->area_nslabs - area->used)) > goto not_found; > -- > 2.44.0.rc1.240.g4c46232300-goog
> From: Will Deacon <will@kernel.org> Sent: Wednesday, February 28, 2024 5:40 AM > > > > For swiotlb allocations >= PAGE_SIZE, the slab search historically > > adjusted the stride to avoid checking unaligned slots. However, this is > > no longer needed now that the code around it has evolved and the > > stride is calculated from the required alignment. > > > > Either 'alloc_align_mask' is used to specify the allocation alignment or > > the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'. > > At least one of these masks is always non-zero. > > I think the patch is correct, but this justification is not. alloc_align_mask > and the DMA min_align_mask are often both zero. While the NVMe > PCI driver sets min_align_mask, SCSI disk drivers do not (except for the > Hyper-V synthetic SCSI driver). When both are zero, presumably > there are no alignment requirements, so a stride of 1 is appropriate. > I did additional checking into the history of page alignment for alloc sizes of a page or greater. The swiotlb code probably made sense prior to commit 0eee5ae10256. This commit has this change: slot_base = area_index * mem->area_nslabs; - index = wrap_area_index(mem, ALIGN(area->index, stride)); + index = area->index; for (slots_checked = 0; slots_checked < mem->area_nslabs; ) { slot_index = slot_base + index; In the old code, once the PAGE_SIZE was factored into the stride, the stride was used to align the starting index, so that the first slot checked would be page aligned. But commit 0eee5ae10256 drops that use of the stride and puts the page alignment in iotlb_align_mask, for reasons explained in the commit message. But in Will's Patch 1 when the page alignment is no longer incorporated into iotlb_align_mask (for good reason), page aligning the stride then doesn't do anything useful, resulting in this patch. If there *is* a requirement for page alignment of page-size-or-greater requests, even when alloc_align_mask and min_align_mask are zero, we need to think about how to do that correctly, as that requirement is no longer met after Patch 1 of this series. Michael
On Thu, Feb 29, 2024 at 07:36:05AM +0000, Michael Kelley wrote: > If there *is* a requirement for page alignment of page-size-or-greater > requests, even when alloc_align_mask and min_align_mask are zero, > we need to think about how to do that correctly, as that requirement > is no longer met after Patch 1 of this series. It has been historical behavior that all dma allocations are page aligned (including in fact smaller than page sized ones that get rounded up to a page size). The documentation actually (incorretly) states an even stronger guarantee: "The CPU virtual address and the DMA address are both guaranteed to be aligned to the smallest PAGE_SIZE order which is greater than or equal to the requested size. This invariant exists (for example) to guarantee that if you allocate a chunk which is smaller than or equal to 64 kilobytes, the extent of the buffer you receive will not cross a 64K boundary."
From: Christoph Hellwig <hch@lst.de> Sent: Thursday, February 29, 2024 5:34 AM > > On Thu, Feb 29, 2024 at 07:36:05AM +0000, Michael Kelley wrote: > > If there *is* a requirement for page alignment of page-size-or-greater > > requests, even when alloc_align_mask and min_align_mask are zero, > > we need to think about how to do that correctly, as that requirement > > is no longer met after Patch 1 of this series. > > It has been historical behavior that all dma allocations are page > aligned (including in fact smaller than page sized ones that get > rounded up to a page size). The documentation actually (incorretly) > states an even stronger guarantee: > > "The CPU virtual address and the DMA address are both > guaranteed to be aligned to the smallest PAGE_SIZE order which > is greater than or equal to the requested size. This invariant > exists (for example) to guarantee that if you allocate a chunk > which is smaller than or equal to 64 kilobytes, the extent of the > buffer you receive will not cross a 64K boundary." Any thoughts on how that historical behavior should apply if the DMA min_align_mask is non-zero, or the alloc_align_mask parameter to swiotbl_tbl_map_single() is non-zero? As currently used, alloc_align_mask is page aligned if the IOMMU granule is >= PAGE_SIZE. But a non-zero min_align_mask could mandate returning a buffer that is not page aligned. Perhaps do the historical behavior only if alloc_align_mask and min_align_mask are both zero? Michael
On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > Any thoughts on how that historical behavior should apply if > the DMA min_align_mask is non-zero, or the alloc_align_mask > parameter to swiotbl_tbl_map_single() is non-zero? As currently > used, alloc_align_mask is page aligned if the IOMMU granule is > >= PAGE_SIZE. But a non-zero min_align_mask could mandate > returning a buffer that is not page aligned. Perhaps do the > historical behavior only if alloc_align_mask and min_align_mask > are both zero? I think the driver setting min_align_mask is a clear indicator that the driver requested a specific alignment and the defaults don't apply. For swiotbl_tbl_map_single as used by dma-iommu I'd have to tak a closer look at how it is used.
On Thu, 29 Feb 2024 16:47:56 +0100 Christoph Hellwig <hch@lst.de> wrote: > On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > > Any thoughts on how that historical behavior should apply if > > the DMA min_align_mask is non-zero, or the alloc_align_mask > > parameter to swiotbl_tbl_map_single() is non-zero? As currently > > used, alloc_align_mask is page aligned if the IOMMU granule is > > >= PAGE_SIZE. But a non-zero min_align_mask could mandate > > returning a buffer that is not page aligned. Perhaps do the > > historical behavior only if alloc_align_mask and min_align_mask > > are both zero? > > I think the driver setting min_align_mask is a clear indicator > that the driver requested a specific alignment and the defaults > don't apply. For swiotbl_tbl_map_single as used by dma-iommu > I'd have to tak a closer look at how it is used. I'm not sure it helps in this discussion, but let me dive into a bit of ancient history to understand how we ended up here. IIRC this behaviour was originally motivated by limitations of PC AT hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM added a page register, but it was on a separate chip and it did not increment when the 8237 address rolled over back to zero. Effectively, the page register selected a 64K-aligned window of 64K buffers. Consequently, DMA buffers could not cross a 64K physical boundary. Thanks to how the buddy allocator works, the 64K-boundary constraint was satisfied by allocation size, and drivers took advantage of it when allocating device buffers. IMO software bounce buffers simply followed the same logic that worked for buffers allocated by the buddy allocator. OTOH min_align_mask was motivated by NVME which prescribes the value of a certain number of low bits in the DMA address (for simplicity assumed to be identical with the same bits in the physical address). The only pre-existing user of alloc_align_mask is x86 IOMMU code, and IIUC it is used to guarantee that unaligned transactions do not share the IOMMU granule with another device. This whole thing is weird, because swiotlb_tbl_map_single() is called like this: aligned_size = iova_align(iovad, size); phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, iova_mask(iovad), dir, attrs); Here: * alloc_size = iova_align(iovad, size) * alloc_align_mask = iova_mask(iovad) Now, iova_align() rounds up its argument to a multiple of iova granule and iova_mask() is simply "granule - 1". This works, because granule size must be a power of 2, and I assume it must also be >= PAGE_SIZE. In that case, the alloc_align_mask argument is not even needed if you adjust the code to match documentation---the resulting buffer will be aligned to a granule boundary by virtue of having a size that is a multiple of the granule size. To sum it up: 1. min_align_mask is by far the most important constraint. Devices will simply stop working if it is not met. 2. Alignment to the smallest PAGE_SIZE order which is greater than or equal to the requested size has been documented, and some drivers may rely on it. 3. alloc_align_mask is a misguided fix for a bug in the above. Correct me if anything of the above is wrong. HTH Petr T
On 2024-03-01 3:39 pm, Petr Tesařík wrote: > On Thu, 29 Feb 2024 16:47:56 +0100 > Christoph Hellwig <hch@lst.de> wrote: > >> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: >>> Any thoughts on how that historical behavior should apply if >>> the DMA min_align_mask is non-zero, or the alloc_align_mask >>> parameter to swiotbl_tbl_map_single() is non-zero? As currently >>> used, alloc_align_mask is page aligned if the IOMMU granule is >>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate >>> returning a buffer that is not page aligned. Perhaps do the >>> historical behavior only if alloc_align_mask and min_align_mask >>> are both zero? >> >> I think the driver setting min_align_mask is a clear indicator >> that the driver requested a specific alignment and the defaults >> don't apply. For swiotbl_tbl_map_single as used by dma-iommu >> I'd have to tak a closer look at how it is used. > > I'm not sure it helps in this discussion, but let me dive into a bit > of ancient history to understand how we ended up here. > > IIRC this behaviour was originally motivated by limitations of PC AT > hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > added a page register, but it was on a separate chip and it did not > increment when the 8237 address rolled over back to zero. Effectively, > the page register selected a 64K-aligned window of 64K buffers. > Consequently, DMA buffers could not cross a 64K physical boundary. > > Thanks to how the buddy allocator works, the 64K-boundary constraint > was satisfied by allocation size, and drivers took advantage of it when > allocating device buffers. IMO software bounce buffers simply followed > the same logic that worked for buffers allocated by the buddy allocator. > > OTOH min_align_mask was motivated by NVME which prescribes the value of > a certain number of low bits in the DMA address (for simplicity assumed > to be identical with the same bits in the physical address). > > The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > IIUC it is used to guarantee that unaligned transactions do not share > the IOMMU granule with another device. This whole thing is weird, > because swiotlb_tbl_map_single() is called like this: > > aligned_size = iova_align(iovad, size); > phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, > iova_mask(iovad), dir, attrs); > > Here: > > * alloc_size = iova_align(iovad, size) > * alloc_align_mask = iova_mask(iovad) > > Now, iova_align() rounds up its argument to a multiple of iova granule > and iova_mask() is simply "granule - 1". This works, because granule > size must be a power of 2, and I assume it must also be >= PAGE_SIZE. Not quite, the granule must *not* be larger than PAGE_SIZE (but can be smaller). > In that case, the alloc_align_mask argument is not even needed if you > adjust the code to match documentation---the resulting buffer will be > aligned to a granule boundary by virtue of having a size that is a > multiple of the granule size. I think we've conflated two different notions of "allocation" here. The page-alignment which Christoph quoted applies for dma_alloc_coherent(), which nowadays *should* only be relevant to SWIOTLB code in the swiotlb_alloc() path for stealing coherent pages out of restricted DMA pools. Historically there was never any official alignment requirement for the DMA addresses returned by dma_map_{page,sg}(), until min_align_mask was introduced. > To sum it up: > > 1. min_align_mask is by far the most important constraint. Devices will > simply stop working if it is not met. > 2. Alignment to the smallest PAGE_SIZE order which is greater than or > equal to the requested size has been documented, and some drivers > may rely on it. Strictly it stopped being necessary since fafadcd16595 when the historical swiotlb_alloc() was removed, but 602d9858f07c implies that indeed the assumption of it for streaming mappings had already set in. Of course NVMe is now using min_align_mask, but I'm not sure if it's worth taking the risk to find out who else should also be. > 3. alloc_align_mask is a misguided fix for a bug in the above. No, alloc_align_mask is about describing the explicit requirement rather than relying on any implicit behaviour, and thus being able to do the optimal thing for, say, a 9KB mapping given a 4KB IOVA granule and 64KB PAGE_SIZE. Thanks, Robin. > > Correct me if anything of the above is wrong. > > HTH > Petr T
On Fri, 1 Mar 2024 16:39:27 +0100 Petr Tesařík <petr@tesarici.cz> wrote: > On Thu, 29 Feb 2024 16:47:56 +0100 > Christoph Hellwig <hch@lst.de> wrote: > > > On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > > > Any thoughts on how that historical behavior should apply if > > > the DMA min_align_mask is non-zero, or the alloc_align_mask > > > parameter to swiotbl_tbl_map_single() is non-zero? As currently > > > used, alloc_align_mask is page aligned if the IOMMU granule is > > > >= PAGE_SIZE. But a non-zero min_align_mask could mandate > > > returning a buffer that is not page aligned. Perhaps do the > > > historical behavior only if alloc_align_mask and min_align_mask > > > are both zero? > > > > I think the driver setting min_align_mask is a clear indicator > > that the driver requested a specific alignment and the defaults > > don't apply. For swiotbl_tbl_map_single as used by dma-iommu > > I'd have to tak a closer look at how it is used. > > I'm not sure it helps in this discussion, but let me dive into a bit > of ancient history to understand how we ended up here. > > IIRC this behaviour was originally motivated by limitations of PC AT > hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > added a page register, but it was on a separate chip and it did not > increment when the 8237 address rolled over back to zero. Effectively, > the page register selected a 64K-aligned window of 64K buffers. > Consequently, DMA buffers could not cross a 64K physical boundary. > > Thanks to how the buddy allocator works, the 64K-boundary constraint > was satisfied by allocation size, and drivers took advantage of it when > allocating device buffers. IMO software bounce buffers simply followed > the same logic that worked for buffers allocated by the buddy allocator. > > OTOH min_align_mask was motivated by NVME which prescribes the value of > a certain number of low bits in the DMA address (for simplicity assumed > to be identical with the same bits in the physical address). > > The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > IIUC it is used to guarantee that unaligned transactions do not share > the IOMMU granule with another device. This whole thing is weird, > because swiotlb_tbl_map_single() is called like this: > > aligned_size = iova_align(iovad, size); > phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, > iova_mask(iovad), dir, attrs); > > Here: > > * alloc_size = iova_align(iovad, size) > * alloc_align_mask = iova_mask(iovad) > > Now, iova_align() rounds up its argument to a multiple of iova granule > and iova_mask() is simply "granule - 1". This works, because granule > size must be a power of 2, and I assume it must also be >= PAGE_SIZE. > > In that case, the alloc_align_mask argument is not even needed if you > adjust the code to match documentation---the resulting buffer will be > aligned to a granule boundary by virtue of having a size that is a > multiple of the granule size. > > To sum it up: > > 1. min_align_mask is by far the most important constraint. Devices will > simply stop working if it is not met. > 2. Alignment to the smallest PAGE_SIZE order which is greater than or > equal to the requested size has been documented, and some drivers > may rely on it. > 3. alloc_align_mask is a misguided fix for a bug in the above. > > Correct me if anything of the above is wrong. I thought about it some more, and I believe I know what should happen if the first two constraints appear to be mutually exclusive. First, the alignment based on size does not guarantee that the resulting physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must be always identical to the original buffer address. Let's take an example request like this: TLB_SIZE = 0x00000800 min_align_mask = 0x0000ffff orig_addr = 0x....1234 alloc_size = 0x00002800 Minimum alignment mask requires to keep the 1234 at the end. Allocation size requires a buffer that is aligned to 16K. Of course, there is no 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are masked off). Since the SWIOTLB API does not guarantee any specific value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a perfectly valid bounce buffer address for this example. The caller may rightfully expect that the 16K granule containing the bounce buffer is not shared with any other user. For the above case I suggest to increase the allocation size to 0x4000 already in swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot address. Will, do you want to incorporate it into your patch series, or should I send the corresponding patch? Petr T
On Fri, 1 Mar 2024 16:38:33 +0000 Robin Murphy <robin.murphy@arm.com> wrote: > On 2024-03-01 3:39 pm, Petr Tesařík wrote: > > On Thu, 29 Feb 2024 16:47:56 +0100 > > Christoph Hellwig <hch@lst.de> wrote: > > > >> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > >>> Any thoughts on how that historical behavior should apply if > >>> the DMA min_align_mask is non-zero, or the alloc_align_mask > >>> parameter to swiotbl_tbl_map_single() is non-zero? As currently > >>> used, alloc_align_mask is page aligned if the IOMMU granule is > >>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate > >>> returning a buffer that is not page aligned. Perhaps do the > >>> historical behavior only if alloc_align_mask and min_align_mask > >>> are both zero? > >> > >> I think the driver setting min_align_mask is a clear indicator > >> that the driver requested a specific alignment and the defaults > >> don't apply. For swiotbl_tbl_map_single as used by dma-iommu > >> I'd have to tak a closer look at how it is used. > > > > I'm not sure it helps in this discussion, but let me dive into a bit > > of ancient history to understand how we ended up here. > > > > IIRC this behaviour was originally motivated by limitations of PC AT > > hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > > usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > > added a page register, but it was on a separate chip and it did not > > increment when the 8237 address rolled over back to zero. Effectively, > > the page register selected a 64K-aligned window of 64K buffers. > > Consequently, DMA buffers could not cross a 64K physical boundary. > > > > Thanks to how the buddy allocator works, the 64K-boundary constraint > > was satisfied by allocation size, and drivers took advantage of it when > > allocating device buffers. IMO software bounce buffers simply followed > > the same logic that worked for buffers allocated by the buddy allocator. > > > > OTOH min_align_mask was motivated by NVME which prescribes the value of > > a certain number of low bits in the DMA address (for simplicity assumed > > to be identical with the same bits in the physical address). > > > > The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > > IIUC it is used to guarantee that unaligned transactions do not share > > the IOMMU granule with another device. This whole thing is weird, > > because swiotlb_tbl_map_single() is called like this: > > > > aligned_size = iova_align(iovad, size); > > phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, > > iova_mask(iovad), dir, attrs); > > > > Here: > > > > * alloc_size = iova_align(iovad, size) > > * alloc_align_mask = iova_mask(iovad) > > > > Now, iova_align() rounds up its argument to a multiple of iova granule > > and iova_mask() is simply "granule - 1". This works, because granule > > size must be a power of 2, and I assume it must also be >= PAGE_SIZE. > > Not quite, the granule must *not* be larger than PAGE_SIZE (but can be > smaller). OK... I must rethink what is achieved with the alignment then. > > In that case, the alloc_align_mask argument is not even needed if you > > adjust the code to match documentation---the resulting buffer will be > > aligned to a granule boundary by virtue of having a size that is a > > multiple of the granule size. > > I think we've conflated two different notions of "allocation" here. The > page-alignment which Christoph quoted applies for dma_alloc_coherent(), > which nowadays *should* only be relevant to SWIOTLB code in the > swiotlb_alloc() path for stealing coherent pages out of restricted DMA > pools. Historically there was never any official alignment requirement > for the DMA addresses returned by dma_map_{page,sg}(), until > min_align_mask was introduced. > > > To sum it up: > > > > 1. min_align_mask is by far the most important constraint. Devices will > > simply stop working if it is not met. > > 2. Alignment to the smallest PAGE_SIZE order which is greater than or > > equal to the requested size has been documented, and some drivers > > may rely on it. > > Strictly it stopped being necessary since fafadcd16595 when the > historical swiotlb_alloc() was removed, but 602d9858f07c implies that > indeed the assumption of it for streaming mappings had already set in. > Of course NVMe is now using min_align_mask, but I'm not sure if it's > worth taking the risk to find out who else should also be. > > > 3. alloc_align_mask is a misguided fix for a bug in the above. > > No, alloc_align_mask is about describing the explicit requirement rather > than relying on any implicit behaviour, and thus being able to do the > optimal thing for, say, a 9KB mapping given a 4KB IOVA granule and 64KB > PAGE_SIZE. It's getting confusing. Can we stay with the IOMMU use case for a moment? Since you don't use IO_TLB_SIZE or IO_TLB_SHIFT, I assume the code should work with any SWIOTLB slot size. For granule sizes up to SWIOTLB slot size, you always get a buffer that is not shared with any other granule. For granule sizes between SWIOTLB slot size and page size, you want to get as many slots as needed to cover the whole granule. Indeed, that would not be achieved with size alone. What if we change the size-based alignment constraint to: Aligned to the smallest IO_TLB_SIZE order which is greater than or equal to the requested size. AFAICS this is stricter, i.e. it covers the already documented PAGE_SIZE alignment, but it would also cover IOMMU needs. I have the feeling that the slot search has too many parameters. This alloc_align_mask has only one in-tree user, so it's a logical candidate for reduction. Petr T > > Thanks, > Robin. > > > > > Correct me if anything of the above is wrong. > > > > HTH > > Petr T >
On 2024-03-01 5:08 pm, Petr Tesařík wrote: > On Fri, 1 Mar 2024 16:39:27 +0100 > Petr Tesařík <petr@tesarici.cz> wrote: > >> On Thu, 29 Feb 2024 16:47:56 +0100 >> Christoph Hellwig <hch@lst.de> wrote: >> >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: >>>> Any thoughts on how that historical behavior should apply if >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently >>>> used, alloc_align_mask is page aligned if the IOMMU granule is >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate >>>> returning a buffer that is not page aligned. Perhaps do the >>>> historical behavior only if alloc_align_mask and min_align_mask >>>> are both zero? >>> >>> I think the driver setting min_align_mask is a clear indicator >>> that the driver requested a specific alignment and the defaults >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu >>> I'd have to tak a closer look at how it is used. >> >> I'm not sure it helps in this discussion, but let me dive into a bit >> of ancient history to understand how we ended up here. >> >> IIRC this behaviour was originally motivated by limitations of PC AT >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM >> added a page register, but it was on a separate chip and it did not >> increment when the 8237 address rolled over back to zero. Effectively, >> the page register selected a 64K-aligned window of 64K buffers. >> Consequently, DMA buffers could not cross a 64K physical boundary. >> >> Thanks to how the buddy allocator works, the 64K-boundary constraint >> was satisfied by allocation size, and drivers took advantage of it when >> allocating device buffers. IMO software bounce buffers simply followed >> the same logic that worked for buffers allocated by the buddy allocator. >> >> OTOH min_align_mask was motivated by NVME which prescribes the value of >> a certain number of low bits in the DMA address (for simplicity assumed >> to be identical with the same bits in the physical address). >> >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and >> IIUC it is used to guarantee that unaligned transactions do not share >> the IOMMU granule with another device. This whole thing is weird, >> because swiotlb_tbl_map_single() is called like this: >> >> aligned_size = iova_align(iovad, size); >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, >> iova_mask(iovad), dir, attrs); >> >> Here: >> >> * alloc_size = iova_align(iovad, size) >> * alloc_align_mask = iova_mask(iovad) >> >> Now, iova_align() rounds up its argument to a multiple of iova granule >> and iova_mask() is simply "granule - 1". This works, because granule >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE. >> >> In that case, the alloc_align_mask argument is not even needed if you >> adjust the code to match documentation---the resulting buffer will be >> aligned to a granule boundary by virtue of having a size that is a >> multiple of the granule size. >> >> To sum it up: >> >> 1. min_align_mask is by far the most important constraint. Devices will >> simply stop working if it is not met. >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or >> equal to the requested size has been documented, and some drivers >> may rely on it. >> 3. alloc_align_mask is a misguided fix for a bug in the above. >> >> Correct me if anything of the above is wrong. > > I thought about it some more, and I believe I know what should happen > if the first two constraints appear to be mutually exclusive. > > First, the alignment based on size does not guarantee that the resulting > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must > be always identical to the original buffer address. > > Let's take an example request like this: > > TLB_SIZE = 0x00000800 > min_align_mask = 0x0000ffff > orig_addr = 0x....1234 > alloc_size = 0x00002800 > > Minimum alignment mask requires to keep the 1234 at the end. Allocation > size requires a buffer that is aligned to 16K. Of course, there is no > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are > masked off). Since the SWIOTLB API does not guarantee any specific > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a > perfectly valid bounce buffer address for this example. > > The caller may rightfully expect that the 16K granule containing the > bounce buffer is not shared with any other user. For the above case I > suggest to increase the allocation size to 0x4000 already in > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot > address. That doesn't make sense - a caller asks to map some range of kernel addresses and they get back a corresponding range of DMA addresses; they cannot make any reasonable assumptions about DMA addresses *outside* that range. In the example above, the caller has explicitly chosen not to map the range xxx0000-xxx1234; if they expect the device to actually access bytes in the DMA range yyy0000-yyy1234, then they should have mapped the whole range starting from xxx0000 and it is their own error. SWIOTLB does not and cannot provide any memory protection itself, so there is no functional benefit to automatically over-allocating, all it will do is waste slots. iommu-dma *can* provide memory protection between individual mappings via additional layers that SWIOTLB doesn't know about, so in that case it's iommu-dma's responsibility to explicitly manage whatever over-allocation is necessary at the SWIOTLB level to match the IOMMU level. Thanks, Robin.
On Fri, 1 Mar 2024 17:54:06 +0000 Robin Murphy <robin.murphy@arm.com> wrote: > On 2024-03-01 5:08 pm, Petr Tesařík wrote: > > On Fri, 1 Mar 2024 16:39:27 +0100 > > Petr Tesařík <petr@tesarici.cz> wrote: > > > >> On Thu, 29 Feb 2024 16:47:56 +0100 > >> Christoph Hellwig <hch@lst.de> wrote: > >> > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > >>>> Any thoughts on how that historical behavior should apply if > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is > >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate > >>>> returning a buffer that is not page aligned. Perhaps do the > >>>> historical behavior only if alloc_align_mask and min_align_mask > >>>> are both zero? > >>> > >>> I think the driver setting min_align_mask is a clear indicator > >>> that the driver requested a specific alignment and the defaults > >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu > >>> I'd have to tak a closer look at how it is used. > >> > >> I'm not sure it helps in this discussion, but let me dive into a bit > >> of ancient history to understand how we ended up here. > >> > >> IIRC this behaviour was originally motivated by limitations of PC AT > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > >> added a page register, but it was on a separate chip and it did not > >> increment when the 8237 address rolled over back to zero. Effectively, > >> the page register selected a 64K-aligned window of 64K buffers. > >> Consequently, DMA buffers could not cross a 64K physical boundary. > >> > >> Thanks to how the buddy allocator works, the 64K-boundary constraint > >> was satisfied by allocation size, and drivers took advantage of it when > >> allocating device buffers. IMO software bounce buffers simply followed > >> the same logic that worked for buffers allocated by the buddy allocator. > >> > >> OTOH min_align_mask was motivated by NVME which prescribes the value of > >> a certain number of low bits in the DMA address (for simplicity assumed > >> to be identical with the same bits in the physical address). > >> > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > >> IIUC it is used to guarantee that unaligned transactions do not share > >> the IOMMU granule with another device. This whole thing is weird, > >> because swiotlb_tbl_map_single() is called like this: > >> > >> aligned_size = iova_align(iovad, size); > >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, > >> iova_mask(iovad), dir, attrs); > >> > >> Here: > >> > >> * alloc_size = iova_align(iovad, size) > >> * alloc_align_mask = iova_mask(iovad) > >> > >> Now, iova_align() rounds up its argument to a multiple of iova granule > >> and iova_mask() is simply "granule - 1". This works, because granule > >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE. > >> > >> In that case, the alloc_align_mask argument is not even needed if you > >> adjust the code to match documentation---the resulting buffer will be > >> aligned to a granule boundary by virtue of having a size that is a > >> multiple of the granule size. > >> > >> To sum it up: > >> > >> 1. min_align_mask is by far the most important constraint. Devices will > >> simply stop working if it is not met. > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or > >> equal to the requested size has been documented, and some drivers > >> may rely on it. > >> 3. alloc_align_mask is a misguided fix for a bug in the above. > >> > >> Correct me if anything of the above is wrong. > > > > I thought about it some more, and I believe I know what should happen > > if the first two constraints appear to be mutually exclusive. > > > > First, the alignment based on size does not guarantee that the resulting > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must > > be always identical to the original buffer address. > > > > Let's take an example request like this: > > > > TLB_SIZE = 0x00000800 > > min_align_mask = 0x0000ffff > > orig_addr = 0x....1234 > > alloc_size = 0x00002800 > > > > Minimum alignment mask requires to keep the 1234 at the end. Allocation > > size requires a buffer that is aligned to 16K. Of course, there is no > > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT > > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are > > masked off). Since the SWIOTLB API does not guarantee any specific > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a > > perfectly valid bounce buffer address for this example. > > > > The caller may rightfully expect that the 16K granule containing the > > bounce buffer is not shared with any other user. For the above case I > > suggest to increase the allocation size to 0x4000 already in > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot > > address. > > That doesn't make sense - a caller asks to map some range of kernel > addresses and they get back a corresponding range of DMA addresses; they > cannot make any reasonable assumptions about DMA addresses *outside* > that range. In the example above, the caller has explicitly chosen not > to map the range xxx0000-xxx1234; if they expect the device to actually > access bytes in the DMA range yyy0000-yyy1234, then they should have > mapped the whole range starting from xxx0000 and it is their own error. I agree that the range was not requested. But it is not wrong if SWIOTLB overallocates. In fact, it usually does overallocate because it works with slot granularity. > SWIOTLB does not and cannot provide any memory protection itself, so > there is no functional benefit to automatically over-allocating, all it > will do is waste slots. iommu-dma *can* provide memory protection > between individual mappings via additional layers that SWIOTLB doesn't > know about, so in that case it's iommu-dma's responsibility to > explicitly manage whatever over-allocation is necessary at the SWIOTLB > level to match the IOMMU level. I'm trying to understand what the caller expects to get if they request both buffer alignment (either given implicitly through mapping size or explicitly with an alloc_align_mask) with a min_align_mask and non-zero low bits covered by the buffer alignment. In other words, if iommu_dma_map_page() gets into this situation: * granule size is 4k * device specifies 64k min_align_mask * bit 11 of the original buffer address is non-zero Then you ask for a pair of slots where the first slot has bit 11 == 0 (required by alignment to granule size) and also has bit 11 == 1 (required to preserve the lowest 16 bits of the original address). Sure, you can fail such a mapping, but is it what the caller expects? Thanks Petr T
From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, March 1, 2024 10:42 AM > > On Fri, 1 Mar 2024 17:54:06 +0000 > Robin Murphy <robin.murphy@arm.com> wrote: > > > On 2024-03-01 5:08 pm, Petr Tesařík wrote: > > > On Fri, 1 Mar 2024 16:39:27 +0100 > > > Petr Tesařík <petr@tesarici.cz> wrote: > > > > > >> On Thu, 29 Feb 2024 16:47:56 +0100 > > >> Christoph Hellwig <hch@lst.de> wrote: > > >> > > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: > > >>>> Any thoughts on how that historical behavior should apply if > > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask > > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently > > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is > > >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate > > >>>> returning a buffer that is not page aligned. Perhaps do the > > >>>> historical behavior only if alloc_align_mask and min_align_mask > > >>>> are both zero? > > >>> > > >>> I think the driver setting min_align_mask is a clear indicator > > >>> that the driver requested a specific alignment and the defaults > > >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu > > >>> I'd have to tak a closer look at how it is used. > > >> > > >> I'm not sure it helps in this discussion, but let me dive into a bit > > >> of ancient history to understand how we ended up here. > > >> > > >> IIRC this behaviour was originally motivated by limitations of PC AT > > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > > >> added a page register, but it was on a separate chip and it did not > > >> increment when the 8237 address rolled over back to zero. Effectively, > > >> the page register selected a 64K-aligned window of 64K buffers. > > >> Consequently, DMA buffers could not cross a 64K physical boundary. > > >> > > >> Thanks to how the buddy allocator works, the 64K-boundary constraint > > >> was satisfied by allocation size, and drivers took advantage of it when > > >> allocating device buffers. IMO software bounce buffers simply followed > > >> the same logic that worked for buffers allocated by the buddy allocator. > > >> > > >> OTOH min_align_mask was motivated by NVME which prescribes the value of > > >> a certain number of low bits in the DMA address (for simplicity assumed > > >> to be identical with the same bits in the physical address). > > >> > > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > > >> IIUC it is used to guarantee that unaligned transactions do not share > > >> the IOMMU granule with another device. This whole thing is weird, > > >> because swiotlb_tbl_map_single() is called like this: > > >> > > >> aligned_size = iova_align(iovad, size); > > >> phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size, > > >> iova_mask(iovad), dir, attrs); > > >> > > >> Here: > > >> > > >> * alloc_size = iova_align(iovad, size) > > >> * alloc_align_mask = iova_mask(iovad) > > >> > > >> Now, iova_align() rounds up its argument to a multiple of iova granule > > >> and iova_mask() is simply "granule - 1". This works, because granule > > >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE. > > >> > > >> In that case, the alloc_align_mask argument is not even needed if you > > >> adjust the code to match documentation---the resulting buffer will be > > >> aligned to a granule boundary by virtue of having a size that is a > > >> multiple of the granule size. > > >> > > >> To sum it up: > > >> > > >> 1. min_align_mask is by far the most important constraint. Devices will > > >> simply stop working if it is not met. > > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or > > >> equal to the requested size has been documented, and some drivers > > >> may rely on it. > > >> 3. alloc_align_mask is a misguided fix for a bug in the above. > > >> > > >> Correct me if anything of the above is wrong. > > > > > > I thought about it some more, and I believe I know what should happen > > > if the first two constraints appear to be mutually exclusive. > > > > > > First, the alignment based on size does not guarantee that the resulting > > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must > > > be always identical to the original buffer address. > > > > > > Let's take an example request like this: > > > > > > TLB_SIZE = 0x00000800 > > > min_align_mask = 0x0000ffff > > > orig_addr = 0x....1234 > > > alloc_size = 0x00002800 > > > > > > Minimum alignment mask requires to keep the 1234 at the end. Allocation > > > size requires a buffer that is aligned to 16K. Of course, there is no > > > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT > > > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are > > > masked off). Since the SWIOTLB API does not guarantee any specific > > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a > > > perfectly valid bounce buffer address for this example. > > > > > > The caller may rightfully expect that the 16K granule containing the > > > bounce buffer is not shared with any other user. For the above case I > > > suggest to increase the allocation size to 0x4000 already in > > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot > > > address. > > > > That doesn't make sense - a caller asks to map some range of kernel > > addresses and they get back a corresponding range of DMA addresses; they > > cannot make any reasonable assumptions about DMA addresses *outside* > > that range. In the example above, the caller has explicitly chosen not > > to map the range xxx0000-xxx1234; if they expect the device to actually > > access bytes in the DMA range yyy0000-yyy1234, then they should have > > mapped the whole range starting from xxx0000 and it is their own error. > > I agree that the range was not requested. But it is not wrong if > SWIOTLB overallocates. In fact, it usually does overallocate because it > works with slot granularity. > > > SWIOTLB does not and cannot provide any memory protection itself, so > > there is no functional benefit to automatically over-allocating, all it > > will do is waste slots. iommu-dma *can* provide memory protection > > between individual mappings via additional layers that SWIOTLB doesn't > > know about, so in that case it's iommu-dma's responsibility to > > explicitly manage whatever over-allocation is necessary at the SWIOTLB > > level to match the IOMMU level. > > I'm trying to understand what the caller expects to get if they request > both buffer alignment (either given implicitly through mapping size or > explicitly with an alloc_align_mask) with a min_align_mask and non-zero > low bits covered by the buffer alignment. > > In other words, if iommu_dma_map_page() gets into this situation: > > * granule size is 4k > * device specifies 64k min_align_mask > * bit 11 of the original buffer address is non-zero > > Then you ask for a pair of slots where the first slot has bit 11 == 0 > (required by alignment to granule size) and also has bit 11 == 1 > (required to preserve the lowest 16 bits of the original address). > > Sure, you can fail such a mapping, but is it what the caller expects? > Here's my take on tying all the threads together. There are four alignment combinations: 1. alloc_align_mask: zero; min_align_mask: zero 2. alloc_align_mask: zero; min_align_mask: non-zero 3. alloc_align_mask: non-zero; min_align_mask: zero/ignored 4. alloc_align_mask: non-zero; min_align_mask: non-zero xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2 via swiotlb_map() and swiotlb_tbl_map_single() iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single() swiotlb_alloc() is #3, directly to swiotlb_find_slots() For #1, the returned physical address has no constraints if the requested size is less than a page. For page size or greater, the discussed historical requirement for page alignment applies. For #2, min_align_mask governs the bits of the returned physical address that must match the original address. When needed, swiotlb must also allocate pre-padding aligned to IO_TLB_SIZE that precedes the returned physical address. A request size <= swiotlb_max_mapping_size() will not exceed IO_TLB_SEGSIZE even with the padding. The historical requirement for page alignment does not apply because the driver has explicitly used the newer min_align_mask feature. For #3, alloc_align_mask specifies the required alignment. No pre-padding is needed. Per earlier comments from Robin[1], it's reasonable to assume alloc_align_mask (i.e., the granule) is >= IO_TLB_SIZE. The original address is not relevant in determining the alignment, and the historical page alignment requirement does not apply since alloc_align_mask explicitly states the alignment. For #4, the returned physical address must match the bits in the original address specified by min_align_mask. swiotlb swiotlb must also allocate pre-padding aligned to alloc_align_mask that precedes the returned physical address. Also per Robin[1], assume alloc_align_mask is >= min_align_mask, which solves the conflicting alignment problem pointed out by Petr[2]. Perhaps we should add a "WARN_ON(alloc_align_mask < min_align_mask)" rather than failing depending on which bits of the original address are set. Again, the historical requirement for page alignment does not apply. I believe Will's patch set implements everything in #2, #3, and #4, except my suggested WARN_ON in #4. The historical page alignment in #1 presumably needs to be added. Also, the current implementation of #4 has a bug in that IO_TLB_SEGSIZE could be exceeded as pointed out here[3], but Robin was OK with not fixing that now. Michael [1] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernel.org/T/#mbd31cbfbdf841336e25f37758c8af1a0b6d8f3eb [2] https://lore.kernel.org/linux-iommu/20240228133930.15400-1-will@kernel.org/T/#mf631679b302b1f5c7cacc82f4c15fb4b19f3dea1 [3] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernel.org/T/#m4179a909777ec751f3dc15b515617477e6682600
On Thu, Feb 29, 2024 at 06:07:32AM +0000, Michael Kelley wrote: > From: Will Deacon <will@kernel.org> Sent: Wednesday, February 28, 2024 5:40 AM > > > > For swiotlb allocations >= PAGE_SIZE, the slab search historically > > adjusted the stride to avoid checking unaligned slots. However, this is > > no longer needed now that the code around it has evolved and the > > stride is calculated from the required alignment. > > > > Either 'alloc_align_mask' is used to specify the allocation alignment or > > the DMA 'min_align_mask' is used to align the allocation with 'orig_addr'. > > At least one of these masks is always non-zero. > > I think the patch is correct, but this justification is not. alloc_align_mask > and the DMA min_align_mask are often both zero. While the NVMe > PCI driver sets min_align_mask, SCSI disk drivers do not (except for the > Hyper-V synthetic SCSI driver). When both are zero, presumably > there are no alignment requirements, so a stride of 1 is appropriate. Sorry, yes, I messed up the commit message here as I was trying to reason through the allocation case separately from the mapping case. However, I need to digest the rest of this thread before doing the obvious fix... Will
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c381a7ed718f..0d8805569f5e 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -1006,13 +1006,6 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool */ stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask)); - /* - * For allocations of PAGE_SIZE or larger only look for page aligned - * allocations. - */ - if (alloc_size >= PAGE_SIZE) - stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1); - spin_lock_irqsave(&area->lock, flags); if (unlikely(nslots > pool->area_nslabs - area->used)) goto not_found;