[v1,2/2] nvme-pci: Fix iommu map (via swiotlb) failures when PAGE_SIZE=64KB

Message ID 60bdcc29a2bcf12c6ab95cf0ea480d67c41c51e7.1707851466.git.nicolinc@nvidia.com
State New
Headers
Series nvme-pci: Fix dma-iommu mapping failures when PAGE_SIZE=64KB |

Commit Message

Nicolin Chen Feb. 13, 2024, 9:53 p.m. UTC
  The min_align_mask is set for swiotlb_max_mapping_size calculation used
by dma_opt_mapping_size in the a few lines behind.

By default, the swiotlb_max_mapping_size() returns 256KB subtracted by
roundup(min_align_mask, IO_TLB_SIZE). Since NVME_CTRL_PAGE_SIZE=4KB, in
this case the swiotlb_max_mapping_size() returns 252KB (256KB - 4KB).

Meanwhile, the alloc_size eventually passed in to swiotlb_tbl_map_single
in dma-iommu is aligned with its attaching domain's iova->granule. If the
domain (backed by an IO page table) is using a 64KB page size, alloc_size
can still ask for 256KB v.s. 252KB, which fails the mapping request:
    systemd[1]: Started Journal Service.
 => nvme 0000:00:01.0: swiotlb buffer is full (sz: 262144 bytes), total 32768 (slots), used 128 (slots)
    note: journal-offline[392] exited with irqs disabled
    note: journal-offline[392] exited with preempt_count 1

Another factor is that the attached domain of an NVME device can change,
so can the value of iova->granule, meaning that the min_align_mask cannot
be set simply using the iova->granule value from its attached domain.

Since the iova->granule usually picks the smallest one from pgsize_bitmap
and it oftens matches with CPU's PAGE_SIZE, simply set min_align_mask to
"PAGE_SIZE - 1".

Note that the other patch "iommu/dma: Force swiotlb_max_mapping_size on
an untrusted device" is required to entirely fix this mapping failure.

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Keith Busch Feb. 13, 2024, 11:31 p.m. UTC | #1
On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>  		dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
>  	else
>  		dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> -	dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> +	dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
>  	dma_set_max_seg_size(&pdev->dev, 0xffffffff);

I recall we had to do this for POWER because they have 64k pages, but
page aligned addresses IOMMU map to 4k, so we needed to allow the lower
dma alignment to efficiently use it.
  
Nicolin Chen Feb. 14, 2024, 6:09 a.m. UTC | #2
On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> >       else
> >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > -     dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > +     dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> >       dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> 
> I recall we had to do this for POWER because they have 64k pages, but
> page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> dma alignment to efficiently use it.

Thanks for the input!

In that case, we might have to rely on iovad->granule from the
attached iommu_domain:

+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+       if (!domain || !is_swiotlb_active(dev) || !dev_is_untrusted(dev))
+               return SIZE_MAX;
+       return ALIGN_DOWN(swiotlb_max_mapping_size(dev),
+                         domain->iova_cookie->iovad.granule);
+}

With this in PATCH-1, we can drop PATCH-2.

Thanks
Nicolin
  
Keith Busch Feb. 15, 2024, 1:36 a.m. UTC | #3
On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > >       else
> > >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > -     dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > +     dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > >       dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> > 
> > I recall we had to do this for POWER because they have 64k pages, but
> > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > dma alignment to efficiently use it.
> 
> Thanks for the input!
> 
> In that case, we might have to rely on iovad->granule from the
> attached iommu_domain:

I explored a bit more, and there is some PPC weirdness that lead to
NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
path. It looks like swiotlb is the only user for this, so your original
patch may be just fine.
  
Nicolin Chen Feb. 15, 2024, 4:46 a.m. UTC | #4
On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
> On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> > On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > > >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > > >       else
> > > >               dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > > -     dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > > +     dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > > >       dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> > >
> > > I recall we had to do this for POWER because they have 64k pages, but
> > > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > > dma alignment to efficiently use it.
> >
> > Thanks for the input!
> >
> > In that case, we might have to rely on iovad->granule from the
> > attached iommu_domain:
> 
> I explored a bit more, and there is some PPC weirdness that lead to
> NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
> path. It looks like swiotlb is the only user for this, so your original
> patch may be just fine.

Oh, that'll be great if we confirmed. And I think I forgot to add
CC line to the stable trees: the two patches should be applicable
cleanly to older kernels too. Let's wait for some day, so people
can give some tests and reviews. Then I will respin a v2 with the
CC line.

Thanks
Nicolin
  
Robin Murphy Feb. 15, 2024, 12:01 p.m. UTC | #5
On 15/02/2024 4:46 am, Nicolin Chen wrote:
> On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
>> On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
>>> On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
>>>> On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
>>>>> @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
>>>>>                dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
>>>>>        else
>>>>>                dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
>>>>> -     dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
>>>>> +     dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
>>>>>        dma_set_max_seg_size(&pdev->dev, 0xffffffff);
>>>>
>>>> I recall we had to do this for POWER because they have 64k pages, but
>>>> page aligned addresses IOMMU map to 4k, so we needed to allow the lower
>>>> dma alignment to efficiently use it.
>>>
>>> Thanks for the input!
>>>
>>> In that case, we might have to rely on iovad->granule from the
>>> attached iommu_domain:
>>
>> I explored a bit more, and there is some PPC weirdness that lead to
>> NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
>> path. It looks like swiotlb is the only user for this, so your original
>> patch may be just fine.
> 
> Oh, that'll be great if we confirmed. And I think I forgot to add
> CC line to the stable trees: the two patches should be applicable
> cleanly to older kernels too. Let's wait for some day, so people
> can give some tests and reviews. Then I will respin a v2 with the
> CC line.

Hmm, as far as I understand, NVME_CTRL_PAGE_SIZE represents the 
alignment that NVMe actually cares about, so if specifying that per the 
intended purpose of the API doesn't work then it implies the DMA layer 
is still not doing its job properly, thus I'd rather keep digging and 
try to fix that properly.

FWIW I have a strong suspicion that iommu-dma may not be correctly doing 
what it thinks it's trying to do, so I would definitely think it 
worthwhile to give that a really close inspection in light of Will's 
SWIOTLB fixes.

Thanks,
Robin.
  
Nicolin Chen Feb. 16, 2024, 1:07 a.m. UTC | #6
On Thu, Feb 15, 2024 at 12:01:34PM +0000, Robin Murphy wrote:
> On 15/02/2024 4:46 am, Nicolin Chen wrote:
> > On Wed, Feb 14, 2024 at 06:36:38PM -0700, Keith Busch wrote:
> > > On Tue, Feb 13, 2024 at 10:09:19PM -0800, Nicolin Chen wrote:
> > > > On Tue, Feb 13, 2024 at 04:31:04PM -0700, Keith Busch wrote:
> > > > > On Tue, Feb 13, 2024 at 01:53:57PM -0800, Nicolin Chen wrote:
> > > > > > @@ -2967,7 +2967,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> > > > > >                dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
> > > > > >        else
> > > > > >                dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
> > > > > > -     dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
> > > > > > +     dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
> > > > > >        dma_set_max_seg_size(&pdev->dev, 0xffffffff);
> > > > > 
> > > > > I recall we had to do this for POWER because they have 64k pages, but
> > > > > page aligned addresses IOMMU map to 4k, so we needed to allow the lower
> > > > > dma alignment to efficiently use it.
> > > > 
> > > > Thanks for the input!
> > > > 
> > > > In that case, we might have to rely on iovad->granule from the
> > > > attached iommu_domain:
> > > 
> > > I explored a bit more, and there is some PPC weirdness that lead to
> > > NVME_CTRL_PAGE_SIZE, I don't find the dma min align mask used in that
> > > path. It looks like swiotlb is the only user for this, so your original
> > > patch may be just fine.
> > 
> > Oh, that'll be great if we confirmed. And I think I forgot to add
> > CC line to the stable trees: the two patches should be applicable
> > cleanly to older kernels too. Let's wait for some day, so people
> > can give some tests and reviews. Then I will respin a v2 with the
> > CC line.
> 
> Hmm, as far as I understand, NVME_CTRL_PAGE_SIZE represents the
> alignment that NVMe actually cares about, so if specifying that per the
> intended purpose of the API doesn't work then it implies the DMA layer
> is still not doing its job properly, thus I'd rather keep digging and
> try to fix that properly.
>
> FWIW I have a strong suspicion that iommu-dma may not be correctly doing
> what it thinks it's trying to do, so I would definitely think it
> worthwhile to give that a really close inspection in light of Will's
> SWIOTLB fixes.

Yes. Let's figure out what's breaking Will's change.

Thanks
Nicolin
  

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e6267a6aa380..ad41fe0bf2e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2967,7 +2967,7 @@  static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 		dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(48));
 	else
 		dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
-	dma_set_min_align_mask(&pdev->dev, NVME_CTRL_PAGE_SIZE - 1);
+	dma_set_min_align_mask(&pdev->dev, PAGE_SIZE - 1);
 	dma_set_max_seg_size(&pdev->dev, 0xffffffff);
 
 	/*