Revert "arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()"

Message ID 20221114110329.68413-1-manivannan.sadhasivam@linaro.org
State New
Headers
Series Revert "arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()" |

Commit Message

Manivannan Sadhasivam Nov. 14, 2022, 11:03 a.m. UTC
  This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.

As reported by Amit [1], dropping cache invalidation from
arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
(most probably on other Qcom platforms too). The reason is, Qcom
qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
for validation. The modem has a secure block (XPU) that will trigger a
whole system crash if the shared memory is accessed by the CPU while modem
is poking at it.

To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
with no kernel mapping, vmap's it, copies the firmware metadata and
unvmap's it. Finally the address is then shared with modem for metadata
validation [2].

Now because of the removal of cache invalidation from
arch_dma_prep_coherent(), there will be cache lines associated with this
memory even after sharing with modem. So when the CPU accesses it, the XPU
violation gets triggered.

So let's revert this commit to get remoteproc's working (thereby avoiding
full system crash) on Qcom platforms.

[1] https://lore.kernel.org/linux-arm-kernel/CAMi1Hd1VBCFhf7+EXWHQWcGy4k=tcyLa7RGiFdprtRnegSG0Mw@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c#n933

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---

Will, Catalin: Please share if you have any other suggestions to handle the
resource sharing in the remoteproc driver that could avoid this revert.

 arch/arm64/mm/dma-mapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Manivannan Sadhasivam Nov. 14, 2022, 11:29 a.m. UTC | #1
On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> 
> As reported by Amit [1], dropping cache invalidation from
> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform

s/SM8250/SDM845/g

Sorry for the confusion.

Thanks,
Mani

> (most probably on other Qcom platforms too). The reason is, Qcom
> qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> for validation. The modem has a secure block (XPU) that will trigger a
> whole system crash if the shared memory is accessed by the CPU while modem
> is poking at it.
> 
> To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> with no kernel mapping, vmap's it, copies the firmware metadata and
> unvmap's it. Finally the address is then shared with modem for metadata
> validation [2].
> 
> Now because of the removal of cache invalidation from
> arch_dma_prep_coherent(), there will be cache lines associated with this
> memory even after sharing with modem. So when the CPU accesses it, the XPU
> violation gets triggered.
> 
> So let's revert this commit to get remoteproc's working (thereby avoiding
> full system crash) on Qcom platforms.
> 
> [1] https://lore.kernel.org/linux-arm-kernel/CAMi1Hd1VBCFhf7+EXWHQWcGy4k=tcyLa7RGiFdprtRnegSG0Mw@mail.gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c#n933
> 
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
> 
> Will, Catalin: Please share if you have any other suggestions to handle the
> resource sharing in the remoteproc driver that could avoid this revert.
> 
>  arch/arm64/mm/dma-mapping.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 3cb101e8cb29..7d7e9a046305 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -36,7 +36,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
>  {
>  	unsigned long start = (unsigned long)page_address(page);
>  
> -	dcache_clean_poc(start, start + size);
> +	dcache_clean_inval_poc(start, start + size);
>  }
>  
>  #ifdef CONFIG_IOMMU_DMA
> -- 
> 2.25.1
>
  
Will Deacon Nov. 14, 2022, 2:11 p.m. UTC | #2
On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> 
> As reported by Amit [1], dropping cache invalidation from
> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> (most probably on other Qcom platforms too). The reason is, Qcom
> qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> for validation. The modem has a secure block (XPU) that will trigger a
> whole system crash if the shared memory is accessed by the CPU while modem
> is poking at it.
> 
> To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> with no kernel mapping, vmap's it, copies the firmware metadata and
> unvmap's it. Finally the address is then shared with modem for metadata
> validation [2].
> 
> Now because of the removal of cache invalidation from
> arch_dma_prep_coherent(), there will be cache lines associated with this
> memory even after sharing with modem. So when the CPU accesses it, the XPU
> violation gets triggered.

This last past is a non-sequitur: the buffer is no longer mapped on the CPU
side, so how would the CPU access it?

As I just replied to Amit, we need more information about what this
"access" is and how it is being detected.

Will
  
Robin Murphy Nov. 14, 2022, 3:14 p.m. UTC | #3
On 2022-11-14 14:11, Will Deacon wrote:
> On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
>> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
>>
>> As reported by Amit [1], dropping cache invalidation from
>> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
>> (most probably on other Qcom platforms too). The reason is, Qcom
>> qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
>> for validation. The modem has a secure block (XPU) that will trigger a
>> whole system crash if the shared memory is accessed by the CPU while modem
>> is poking at it.
>>
>> To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
>> with no kernel mapping, vmap's it, copies the firmware metadata and
>> unvmap's it. Finally the address is then shared with modem for metadata
>> validation [2].
>>
>> Now because of the removal of cache invalidation from
>> arch_dma_prep_coherent(), there will be cache lines associated with this
>> memory even after sharing with modem. So when the CPU accesses it, the XPU
>> violation gets triggered.
> 
> This last past is a non-sequitur: the buffer is no longer mapped on the CPU
> side, so how would the CPU access it?

Right, for the previous change to have made a difference the offending 
part of this buffer must be present in some cache somewhere *before* the 
DMA buffer allocation completes.

Clearly that driver is completely broken though. If the DMA allocation 
came from a no-map carveout vma_dma_alloc_from_dev_coherent() then the 
vmap() shenanigans wouldn't work, so if it backed by struct pages then 
the whole dance is still pointless because *a cacheable linear mapping 
exists*, and it's just relying on the reduced chance that anything's 
going to re-fetch the linear map address after those pages have been 
allocated, exactly as I called out previously[1].

Robin.

[1] 
https://lore.kernel.org/linux-arm-kernel/97fface8-e40e-072c-4335-c94094884e93@arm.com/

> As I just replied to Amit, we need more information about what this
> "access" is and how it is being detected.
> 
> Will
  
Catalin Marinas Nov. 14, 2022, 5:38 p.m. UTC | #4
On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
> On 2022-11-14 14:11, Will Deacon wrote:
> > On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> > > This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> > > 
> > > As reported by Amit [1], dropping cache invalidation from
> > > arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> > > (most probably on other Qcom platforms too). The reason is, Qcom
> > > qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> > > for validation. The modem has a secure block (XPU) that will trigger a
> > > whole system crash if the shared memory is accessed by the CPU while modem
> > > is poking at it.
> > > 
> > > To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> > > with no kernel mapping, vmap's it, copies the firmware metadata and
> > > unvmap's it. Finally the address is then shared with modem for metadata
> > > validation [2].
> > > 
> > > Now because of the removal of cache invalidation from
> > > arch_dma_prep_coherent(), there will be cache lines associated with this
> > > memory even after sharing with modem. So when the CPU accesses it, the XPU
> > > violation gets triggered.
> > 
> > This last past is a non-sequitur: the buffer is no longer mapped on the CPU
> > side, so how would the CPU access it?
> 
> Right, for the previous change to have made a difference the offending part
> of this buffer must be present in some cache somewhere *before* the DMA
> buffer allocation completes.
> 
> Clearly that driver is completely broken though. If the DMA allocation came
> from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
> shenanigans wouldn't work, so if it backed by struct pages then the whole
> dance is still pointless because *a cacheable linear mapping exists*, and
> it's just relying on the reduced chance that anything's going to re-fetch
> the linear map address after those pages have been allocated, exactly as I
> called out previously[1].

So I guess a DMA pool that's not mapped in the linear map, together with
memremap() instead of vmap(), would work around the issue. But the
driver needs fixing, not the arch code.
  
Manivannan Sadhasivam Nov. 18, 2022, 10:54 a.m. UTC | #5
On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
> On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
> > On 2022-11-14 14:11, Will Deacon wrote:
> > > On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> > > > This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> > > > 
> > > > As reported by Amit [1], dropping cache invalidation from
> > > > arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> > > > (most probably on other Qcom platforms too). The reason is, Qcom
> > > > qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> > > > for validation. The modem has a secure block (XPU) that will trigger a
> > > > whole system crash if the shared memory is accessed by the CPU while modem
> > > > is poking at it.
> > > > 
> > > > To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> > > > with no kernel mapping, vmap's it, copies the firmware metadata and
> > > > unvmap's it. Finally the address is then shared with modem for metadata
> > > > validation [2].
> > > > 
> > > > Now because of the removal of cache invalidation from
> > > > arch_dma_prep_coherent(), there will be cache lines associated with this
> > > > memory even after sharing with modem. So when the CPU accesses it, the XPU
> > > > violation gets triggered.
> > > 
> > > This last past is a non-sequitur: the buffer is no longer mapped on the CPU
> > > side, so how would the CPU access it?
> > 
> > Right, for the previous change to have made a difference the offending part
> > of this buffer must be present in some cache somewhere *before* the DMA
> > buffer allocation completes.
> > 
> > Clearly that driver is completely broken though. If the DMA allocation came
> > from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
> > shenanigans wouldn't work, so if it backed by struct pages then the whole
> > dance is still pointless because *a cacheable linear mapping exists*, and
> > it's just relying on the reduced chance that anything's going to re-fetch
> > the linear map address after those pages have been allocated, exactly as I
> > called out previously[1].
> 
> So I guess a DMA pool that's not mapped in the linear map, together with
> memremap() instead of vmap(), would work around the issue. But the
> driver needs fixing, not the arch code.
> 

Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
not part of the kernel's linear map? I looked into it but couldn't find a way.

Thanks,
Mani

> -- 
> Catalin
  
Will Deacon Nov. 18, 2022, 12:33 p.m. UTC | #6
On Fri, Nov 18, 2022 at 04:24:02PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
> > On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
> > > On 2022-11-14 14:11, Will Deacon wrote:
> > > > On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> > > > > This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> > > > > 
> > > > > As reported by Amit [1], dropping cache invalidation from
> > > > > arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> > > > > (most probably on other Qcom platforms too). The reason is, Qcom
> > > > > qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> > > > > for validation. The modem has a secure block (XPU) that will trigger a
> > > > > whole system crash if the shared memory is accessed by the CPU while modem
> > > > > is poking at it.
> > > > > 
> > > > > To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> > > > > with no kernel mapping, vmap's it, copies the firmware metadata and
> > > > > unvmap's it. Finally the address is then shared with modem for metadata
> > > > > validation [2].
> > > > > 
> > > > > Now because of the removal of cache invalidation from
> > > > > arch_dma_prep_coherent(), there will be cache lines associated with this
> > > > > memory even after sharing with modem. So when the CPU accesses it, the XPU
> > > > > violation gets triggered.
> > > > 
> > > > This last past is a non-sequitur: the buffer is no longer mapped on the CPU
> > > > side, so how would the CPU access it?
> > > 
> > > Right, for the previous change to have made a difference the offending part
> > > of this buffer must be present in some cache somewhere *before* the DMA
> > > buffer allocation completes.
> > > 
> > > Clearly that driver is completely broken though. If the DMA allocation came
> > > from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
> > > shenanigans wouldn't work, so if it backed by struct pages then the whole
> > > dance is still pointless because *a cacheable linear mapping exists*, and
> > > it's just relying on the reduced chance that anything's going to re-fetch
> > > the linear map address after those pages have been allocated, exactly as I
> > > called out previously[1].
> > 
> > So I guess a DMA pool that's not mapped in the linear map, together with
> > memremap() instead of vmap(), would work around the issue. But the
> > driver needs fixing, not the arch code.
> > 
> 
> Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
> not part of the kernel's linear map? I looked into it but couldn't find a way.

The no-map property should take care of this iirc

Will
  
Manivannan Sadhasivam Nov. 21, 2022, 6:42 a.m. UTC | #7
On Fri, Nov 18, 2022 at 12:33:49PM +0000, Will Deacon wrote:
> On Fri, Nov 18, 2022 at 04:24:02PM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
> > > On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
> > > > On 2022-11-14 14:11, Will Deacon wrote:
> > > > > On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
> > > > > > This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> > > > > > 
> > > > > > As reported by Amit [1], dropping cache invalidation from
> > > > > > arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> > > > > > (most probably on other Qcom platforms too). The reason is, Qcom
> > > > > > qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> > > > > > for validation. The modem has a secure block (XPU) that will trigger a
> > > > > > whole system crash if the shared memory is accessed by the CPU while modem
> > > > > > is poking at it.
> > > > > > 
> > > > > > To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
> > > > > > with no kernel mapping, vmap's it, copies the firmware metadata and
> > > > > > unvmap's it. Finally the address is then shared with modem for metadata
> > > > > > validation [2].
> > > > > > 
> > > > > > Now because of the removal of cache invalidation from
> > > > > > arch_dma_prep_coherent(), there will be cache lines associated with this
> > > > > > memory even after sharing with modem. So when the CPU accesses it, the XPU
> > > > > > violation gets triggered.
> > > > > 
> > > > > This last past is a non-sequitur: the buffer is no longer mapped on the CPU
> > > > > side, so how would the CPU access it?
> > > > 
> > > > Right, for the previous change to have made a difference the offending part
> > > > of this buffer must be present in some cache somewhere *before* the DMA
> > > > buffer allocation completes.
> > > > 
> > > > Clearly that driver is completely broken though. If the DMA allocation came
> > > > from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
> > > > shenanigans wouldn't work, so if it backed by struct pages then the whole
> > > > dance is still pointless because *a cacheable linear mapping exists*, and
> > > > it's just relying on the reduced chance that anything's going to re-fetch
> > > > the linear map address after those pages have been allocated, exactly as I
> > > > called out previously[1].
> > > 
> > > So I guess a DMA pool that's not mapped in the linear map, together with
> > > memremap() instead of vmap(), would work around the issue. But the
> > > driver needs fixing, not the arch code.
> > > 
> > 
> > Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
> > not part of the kernel's linear map? I looked into it but couldn't find a way.
> 
> The no-map property should take care of this iirc
> 

Yeah, we have been using it in other places of the same driver. But as per
Sibi, we used dynamic allocation for metadata validation since there was no
memory reserved statically for that.

But if we do not have a way to allocate a dynamic memory that is not part of
kernel's linear map, then we may have to resort to using an existing reserved
memory.

Thanks,
Mani

> Will
  
Sibi Sankar Nov. 21, 2022, 10:12 a.m. UTC | #8
On 11/21/22 12:12, Manivannan Sadhasivam wrote:
> On Fri, Nov 18, 2022 at 12:33:49PM +0000, Will Deacon wrote:
>> On Fri, Nov 18, 2022 at 04:24:02PM +0530, Manivannan Sadhasivam wrote:
>>> On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
>>>> On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
>>>>> On 2022-11-14 14:11, Will Deacon wrote:
>>>>>> On Mon, Nov 14, 2022 at 04:33:29PM +0530, Manivannan Sadhasivam wrote:
>>>>>>> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
>>>>>>>
>>>>>>> As reported by Amit [1], dropping cache invalidation from
>>>>>>> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
>>>>>>> (most probably on other Qcom platforms too). The reason is, Qcom
>>>>>>> qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
>>>>>>> for validation. The modem has a secure block (XPU) that will trigger a
>>>>>>> whole system crash if the shared memory is accessed by the CPU while modem
>>>>>>> is poking at it.
>>>>>>>
>>>>>>> To avoid this issue, the qcom_q6v5_mss driver allocates a chunk of memory
>>>>>>> with no kernel mapping, vmap's it, copies the firmware metadata and
>>>>>>> unvmap's it. Finally the address is then shared with modem for metadata
>>>>>>> validation [2].
>>>>>>>
>>>>>>> Now because of the removal of cache invalidation from
>>>>>>> arch_dma_prep_coherent(), there will be cache lines associated with this
>>>>>>> memory even after sharing with modem. So when the CPU accesses it, the XPU
>>>>>>> violation gets triggered.
>>>>>>
>>>>>> This last past is a non-sequitur: the buffer is no longer mapped on the CPU
>>>>>> side, so how would the CPU access it?
>>>>>
>>>>> Right, for the previous change to have made a difference the offending part
>>>>> of this buffer must be present in some cache somewhere *before* the DMA
>>>>> buffer allocation completes.
>>>>>
>>>>> Clearly that driver is completely broken though. If the DMA allocation came
>>>>> from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
>>>>> shenanigans wouldn't work, so if it backed by struct pages then the whole
>>>>> dance is still pointless because *a cacheable linear mapping exists*, and
>>>>> it's just relying on the reduced chance that anything's going to re-fetch
>>>>> the linear map address after those pages have been allocated, exactly as I
>>>>> called out previously[1].
>>>>
>>>> So I guess a DMA pool that's not mapped in the linear map, together with
>>>> memremap() instead of vmap(), would work around the issue. But the
>>>> driver needs fixing, not the arch code.
>>>>
>>>
>>> Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
>>> not part of the kernel's linear map? I looked into it but couldn't find a way.
>>
>> The no-map property should take care of this iirc
>>
> 
> Yeah, we have been using it in other places of the same driver. But as per
> Sibi, we used dynamic allocation for metadata validation since there was no
> memory reserved statically for that.

Will,

Unlike the other portions in the driver that required statically defined 
no-map carveouts, metadata just needed a contiguous memory for 
authentication. Re-using existing carveouts for this metadata region
may not work due to modem FW limitations and declaring a new carveout 
for metadata will break the device tree bindings. That's the reason for
using DMA_ATTR_NO_KERNEL_MAPPING for dma_alloc_attr and vmpa/vunmap with
VM_FLUSH_RESET_PERMS before passing the memory onto modem. Are there 
other suggestions for achieving the same without breaking bindings?

- Sibi

> 
> But if we do not have a way to allocate a dynamic memory that is not part of
> kernel's linear map, then we may have to resort to using an existing reserved
> memory.
> 
> Thanks,
> Mani
> 
>> Will
>
  
Catalin Marinas Nov. 24, 2022, 11:55 a.m. UTC | #9
On Mon, Nov 21, 2022 at 03:42:27PM +0530, Sibi Sankar wrote:
> On 11/21/22 12:12, Manivannan Sadhasivam wrote:
> > On Fri, Nov 18, 2022 at 12:33:49PM +0000, Will Deacon wrote:
> > > On Fri, Nov 18, 2022 at 04:24:02PM +0530, Manivannan Sadhasivam wrote:
> > > > On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
> > > > > On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
> > > > > > Clearly that driver is completely broken though. If the DMA allocation came
> > > > > > from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
> > > > > > shenanigans wouldn't work, so if it backed by struct pages then the whole
> > > > > > dance is still pointless because *a cacheable linear mapping exists*, and
> > > > > > it's just relying on the reduced chance that anything's going to re-fetch
> > > > > > the linear map address after those pages have been allocated, exactly as I
> > > > > > called out previously[1].
> > > > > 
> > > > > So I guess a DMA pool that's not mapped in the linear map, together with
> > > > > memremap() instead of vmap(), would work around the issue. But the
> > > > > driver needs fixing, not the arch code.
> > > > 
> > > > Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
> > > > not part of the kernel's linear map? I looked into it but couldn't find a way.
> > > 
> > > The no-map property should take care of this iirc
> > 
> > Yeah, we have been using it in other places of the same driver. But as per
> > Sibi, we used dynamic allocation for metadata validation since there was no
> > memory reserved statically for that.
> 
> Unlike the other portions in the driver that required statically defined
> no-map carveouts, metadata just needed a contiguous memory for
> authentication. Re-using existing carveouts for this metadata region
> may not work due to modem FW limitations and declaring a new carveout for
> metadata will break the device tree bindings. That's the reason for
> using DMA_ATTR_NO_KERNEL_MAPPING for dma_alloc_attr and vmpa/vunmap with
> VM_FLUSH_RESET_PERMS before passing the memory onto modem. Are there other
> suggestions for achieving the same without breaking bindings?

Your DMA_ATTR_NO_KERNEL_MAPPING workaround doesn't work, it only makes
the failure rate smaller. All this attribute does is avoiding creating a
non-cacheable mapping but you still have the kernel linear mapping in
place that may be speculatively accessed by the CPU. You were just lucky
so far not to have hit the issue. So I'd rather see this fixed properly
with a no-map carveout. Maybe you can reuse an existing carveout if the
driver already needs some and avoid changing the DT. More complicated
options include allocating memory and unmapping it from the linear map
with set_memory_valid(), though that's not exported to modules and it
also requires the linear map to be pages only, not block mappings.

Yet another option is to have the swiotlb buffer unmapped from the
kernel linear map and use the bounce buffer for this. That's more
involved (Robin has some patches, though for a different reason and they
may not make it upstream).
  
Linux regression tracking (Thorsten Leemhuis) Nov. 28, 2022, 5:44 a.m. UTC | #10
On 14.11.22 12:03, Manivannan Sadhasivam wrote:
> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> 
> As reported by Amit [1], dropping cache invalidation from
> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> (most probably on other Qcom platforms too). The reason is, Qcom
> qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> for validation. The modem has a secure block (XPU) that will trigger a
> whole system crash if the shared memory is accessed by the CPU while modem
> is poking at it.
> [...]
> [1] https://lore.kernel.org/linux-arm-kernel/CAMi1Hd1VBCFhf7+EXWHQWcGy4k=tcyLa7RGiFdprtRnegSG0Mw@mail.gmail.com/
>

I have Amit's report on the list of tracked regressions. I also noticed
the proposed change "arm64: dts: qcom: sc8280xp: fix PCIe DMA coherency":
https://lore.kernel.org/all/20221124142501.29314-1-johan+linaro@kernel.org/

I have no expertise in this area, but it looked somewhat related to my
eyes, so please allow me to quickly ask: is that related or even
supposed to fix Amit's regression?

Ciao, Thorsten
  
Manivannan Sadhasivam Nov. 28, 2022, 8:15 a.m. UTC | #11
On Mon, Nov 28, 2022 at 06:44:13AM +0100, Thorsten Leemhuis wrote:
> On 14.11.22 12:03, Manivannan Sadhasivam wrote:
> > This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
> > 
> > As reported by Amit [1], dropping cache invalidation from
> > arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> > (most probably on other Qcom platforms too). The reason is, Qcom
> > qcom_q6v5_mss driver copies the firmware metadata and shares it with modem
> > for validation. The modem has a secure block (XPU) that will trigger a
> > whole system crash if the shared memory is accessed by the CPU while modem
> > is poking at it.
> > [...]
> > [1] https://lore.kernel.org/linux-arm-kernel/CAMi1Hd1VBCFhf7+EXWHQWcGy4k=tcyLa7RGiFdprtRnegSG0Mw@mail.gmail.com/
> >
> 
> I have Amit's report on the list of tracked regressions. I also noticed
> the proposed change "arm64: dts: qcom: sc8280xp: fix PCIe DMA coherency":
> https://lore.kernel.org/all/20221124142501.29314-1-johan+linaro@kernel.org/
> 
> I have no expertise in this area, but it looked somewhat related to my
> eyes, so please allow me to quickly ask: is that related or even
> supposed to fix Amit's regression?
> 

The proposed patch doesn't fix the regression reported by Amit. But the patch
itself fixes an issue that might be triggered more frequently by c44094eee32f.

Thanks,
Mani

> Ciao, Thorsten
  
Linux regression tracking (Thorsten Leemhuis) Dec. 1, 2022, 9:29 a.m. UTC | #12
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Has any progress been made to fix this regression? It afaics is not a
release critical issue, but well, it still would be nice to get this
fixed before 6.1 is released.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

On 24.11.22 12:55, Catalin Marinas wrote:
> On Mon, Nov 21, 2022 at 03:42:27PM +0530, Sibi Sankar wrote:
>> On 11/21/22 12:12, Manivannan Sadhasivam wrote:
>>> On Fri, Nov 18, 2022 at 12:33:49PM +0000, Will Deacon wrote:
>>>> On Fri, Nov 18, 2022 at 04:24:02PM +0530, Manivannan Sadhasivam wrote:
>>>>> On Mon, Nov 14, 2022 at 05:38:00PM +0000, Catalin Marinas wrote:
>>>>>> On Mon, Nov 14, 2022 at 03:14:21PM +0000, Robin Murphy wrote:
>>>>>>> Clearly that driver is completely broken though. If the DMA allocation came
>>>>>>> from a no-map carveout vma_dma_alloc_from_dev_coherent() then the vmap()
>>>>>>> shenanigans wouldn't work, so if it backed by struct pages then the whole
>>>>>>> dance is still pointless because *a cacheable linear mapping exists*, and
>>>>>>> it's just relying on the reduced chance that anything's going to re-fetch
>>>>>>> the linear map address after those pages have been allocated, exactly as I
>>>>>>> called out previously[1].
>>>>>>
>>>>>> So I guess a DMA pool that's not mapped in the linear map, together with
>>>>>> memremap() instead of vmap(), would work around the issue. But the
>>>>>> driver needs fixing, not the arch code.
>>>>>
>>>>> Okay, thanks for the hint. Can you share how to allocate the dma-pool that's
>>>>> not part of the kernel's linear map? I looked into it but couldn't find a way.
>>>>
>>>> The no-map property should take care of this iirc
>>>
>>> Yeah, we have been using it in other places of the same driver. But as per
>>> Sibi, we used dynamic allocation for metadata validation since there was no
>>> memory reserved statically for that.
>>
>> Unlike the other portions in the driver that required statically defined
>> no-map carveouts, metadata just needed a contiguous memory for
>> authentication. Re-using existing carveouts for this metadata region
>> may not work due to modem FW limitations and declaring a new carveout for
>> metadata will break the device tree bindings. That's the reason for
>> using DMA_ATTR_NO_KERNEL_MAPPING for dma_alloc_attr and vmpa/vunmap with
>> VM_FLUSH_RESET_PERMS before passing the memory onto modem. Are there other
>> suggestions for achieving the same without breaking bindings?
> 
> Your DMA_ATTR_NO_KERNEL_MAPPING workaround doesn't work, it only makes
> the failure rate smaller. All this attribute does is avoiding creating a
> non-cacheable mapping but you still have the kernel linear mapping in
> place that may be speculatively accessed by the CPU. You were just lucky
> so far not to have hit the issue. So I'd rather see this fixed properly
> with a no-map carveout. Maybe you can reuse an existing carveout if the
> driver already needs some and avoid changing the DT. More complicated
> options include allocating memory and unmapping it from the linear map
> with set_memory_valid(), though that's not exported to modules and it
> also requires the linear map to be pages only, not block mappings.
> 
> Yet another option is to have the swiotlb buffer unmapped from the
> kernel linear map and use the bounce buffer for this. That's more
> involved (Robin has some patches, though for a different reason and they
> may not make it upstream).
>
  
Catalin Marinas Dec. 1, 2022, 5:45 p.m. UTC | #13
On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> Has any progress been made to fix this regression? It afaics is not a
> release critical issue, but well, it still would be nice to get this
> fixed before 6.1 is released.

The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
that exposed the driver bug. It doesn't fix the actual bug, it only
makes it less likely to happen.

I like the original commit removing the cache invalidation as it shows
drivers not behaving properly but, as a workaround, we could add a
command line option to force back the old behaviour (defaulting to the
new one) until the driver is fixed.
  
Amit Pundir Dec. 2, 2022, 8:26 a.m. UTC | #14
On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > Has any progress been made to fix this regression? It afaics is not a
> > release critical issue, but well, it still would be nice to get this
> > fixed before 6.1 is released.
>
> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> that exposed the driver bug. It doesn't fix the actual bug, it only
> makes it less likely to happen.
>
> I like the original commit removing the cache invalidation as it shows
> drivers not behaving properly but, as a workaround, we could add a
> command line option to force back the old behaviour (defaulting to the
> new one) until the driver is fixed.

We use DB845c extensively for mainline and android-mainline[1] testing
with AOSP, and it is broken for weeks now. So be it a temporary
workaround or a proper driver fix in place, we'd really appreciate a
quick fix here.

I understand that the revert doesn't fix the actual driver bug, but we
were very very lucky so far that we had never hit this issue before.
So at this point I'll take the revert of the upstream commit as well,
while a proper fix is being worked upon.

Regards,
Amit Pundir
[1] https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline




>
> --
> Catalin
  
Linux regression tracking (Thorsten Leemhuis) Dec. 2, 2022, 8:54 a.m. UTC | #15
On 02.12.22 09:26, Amit Pundir wrote:
> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>
>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
>>> Has any progress been made to fix this regression? It afaics is not a
>>> release critical issue, but well, it still would be nice to get this
>>> fixed before 6.1 is released.
>>
>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
>> that exposed the driver bug. It doesn't fix the actual bug, it only
>> makes it less likely to happen.
>>
>> I like the original commit removing the cache invalidation as it shows
>> drivers not behaving properly

Yeah, I understand that, but I guess it's my job to ask at this point:
"is continuing to live with the old behavior for one or two more cycles"
that much of a problem"?

>> but, as a workaround, we could add a
>> command line option to force back the old behaviour (defaulting to the
>> new one) until the driver is fixed.

Well, sometimes that approach is fine to fix a regression, but I'm not
sure this is one of those situations, as this...

> We use DB845c extensively for mainline and android-mainline[1] testing
> with AOSP, and it is broken for weeks now. So be it a temporary
> workaround or a proper driver fix in place, we'd really appreciate a
> quick fix here.

...doesn't sound like we are not talking about some odd corner case
here. But in the end that would be up to Linus to decide.

I'll point him to this thread once more in my weekly report anyway.
Maybe I'll even suggest to revert this change, not sure yet.

> [...]

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
  
Will Deacon Dec. 2, 2022, 10:03 a.m. UTC | #16
On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> On 02.12.22 09:26, Amit Pundir wrote:
> > On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>
> >> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> >>> Has any progress been made to fix this regression? It afaics is not a
> >>> release critical issue, but well, it still would be nice to get this
> >>> fixed before 6.1 is released.
> >>
> >> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> >> that exposed the driver bug. It doesn't fix the actual bug, it only
> >> makes it less likely to happen.
> >>
> >> I like the original commit removing the cache invalidation as it shows
> >> drivers not behaving properly
> 
> Yeah, I understand that, but I guess it's my job to ask at this point:
> "is continuing to live with the old behavior for one or two more cycles"
> that much of a problem"?

That wouldn't be a problem. The problem is that I haven't see any efforts
from the Qualcomm side to actually fix the drivers so what makes you think
the issue will be addressed in one or two more cycles? On the other hand, if
there were patches out there trying to fix the drivers and we just needed to
revert this change to buy them some time, then that would obviously be the
right thing to do.

> >> but, as a workaround, we could add a
> >> command line option to force back the old behaviour (defaulting to the
> >> new one) until the driver is fixed.
> 
> Well, sometimes that approach is fine to fix a regression, but I'm not
> sure this is one of those situations, as this...
> 
> > We use DB845c extensively for mainline and android-mainline[1] testing
> > with AOSP, and it is broken for weeks now. So be it a temporary
> > workaround or a proper driver fix in place, we'd really appreciate a
> > quick fix here.
> 
> ...doesn't sound like we are not talking about some odd corner case
> here. But in the end that would be up to Linus to decide.

The issue is that these drivers are abusing the DMA API to manage buffers
which are being transferred to trustzone. Even with the revert, this is
broken (the CPU can speculate from the kernel's cacheable linear mapping
of memory), it just appears to be less likely with the CPUs on this SoC.
So we end up in a situation where the kernel is flakey on these devices
but with even less incentive for the drivers to be fixed.

As well as broken drivers, the patch has also identified broken device-tree
files where DMA-coherent devices weher incorrectly being treated as
non-coherent:

https://lore.kernel.org/linux-arm-kernel/20221124142501.29314-1-johan+linaro@kernel.org/

so I do think it's something that's worth having as the default behaviour.

> I'll point him to this thread once more in my weekly report anyway.
> Maybe I'll even suggest to revert this change, not sure yet.

As I said above, I think the revert makes sense if the drivers are actually
being fixed, but I'm not seeing any movement at all on that front.

Will
  
Linux regression tracking (Thorsten Leemhuis) Dec. 2, 2022, 10:34 a.m. UTC | #17
On 02.12.22 11:03, Will Deacon wrote:
> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
>> On 02.12.22 09:26, Amit Pundir wrote:
>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>>>
>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
>>>>> Has any progress been made to fix this regression? It afaics is not a
>>>>> release critical issue, but well, it still would be nice to get this
>>>>> fixed before 6.1 is released.
>>>>
>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
>>>> makes it less likely to happen.
>>>>
>>>> I like the original commit removing the cache invalidation as it shows
>>>> drivers not behaving properly
>>
>> Yeah, I understand that, but I guess it's my job to ask at this point:
>> "is continuing to live with the old behavior for one or two more cycles"
>> that much of a problem"?
> 
> That wouldn't be a problem. The problem is that I haven't see any efforts
> from the Qualcomm side to actually fix the drivers [...]

Thx for sharing the details. I can fully understand your pain. But well,
in the end it looks to me like this commit it intentionally breaking
something that used to work -- which to my understanding of the "no
regression rule" is not okay, even if things only worked by chance and
not flawless.

But well, as with every rule there are misunderstandings, grey areas,
and situations where judgement calls have to be made. Then it's up to
Linus to decide how to handle things. Hence I'll just point him to this
thread and then he can decide. No biggie. And sorry if I'm being a PITA
here, I just thing doing that is my duty as regression tracker in
situations like this. Hope your won't mind.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
  
Manivannan Sadhasivam Dec. 2, 2022, 10:54 a.m. UTC | #18
On Fri, Dec 02, 2022 at 10:03:58AM +0000, Will Deacon wrote:
> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> > On 02.12.22 09:26, Amit Pundir wrote:
> > > On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > >>
> > >> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > >>> Has any progress been made to fix this regression? It afaics is not a
> > >>> release critical issue, but well, it still would be nice to get this
> > >>> fixed before 6.1 is released.
> > >>
> > >> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> > >> that exposed the driver bug. It doesn't fix the actual bug, it only
> > >> makes it less likely to happen.
> > >>
> > >> I like the original commit removing the cache invalidation as it shows
> > >> drivers not behaving properly
> > 
> > Yeah, I understand that, but I guess it's my job to ask at this point:
> > "is continuing to live with the old behavior for one or two more cycles"
> > that much of a problem"?
> 
> That wouldn't be a problem. The problem is that I haven't see any efforts
> from the Qualcomm side to actually fix the drivers so what makes you think
> the issue will be addressed in one or two more cycles? On the other hand, if
> there were patches out there trying to fix the drivers and we just needed to
> revert this change to buy them some time, then that would obviously be the
> right thing to do.
> 

There are efforts going on to fix the driver from Qualcomm. It's just that the
patches are not available yet. The delay is mainly due to the internal
communication that should happen between the internal teams.

The fix would be use a separate no-map carveout for the usecase.

But it'd be good to revert this patch untill those patches get merged.

Thanks,
Mani

> > >> but, as a workaround, we could add a
> > >> command line option to force back the old behaviour (defaulting to the
> > >> new one) until the driver is fixed.
> > 
> > Well, sometimes that approach is fine to fix a regression, but I'm not
> > sure this is one of those situations, as this...
> > 
> > > We use DB845c extensively for mainline and android-mainline[1] testing
> > > with AOSP, and it is broken for weeks now. So be it a temporary
> > > workaround or a proper driver fix in place, we'd really appreciate a
> > > quick fix here.
> > 
> > ...doesn't sound like we are not talking about some odd corner case
> > here. But in the end that would be up to Linus to decide.
> 
> The issue is that these drivers are abusing the DMA API to manage buffers
> which are being transferred to trustzone. Even with the revert, this is
> broken (the CPU can speculate from the kernel's cacheable linear mapping
> of memory), it just appears to be less likely with the CPUs on this SoC.
> So we end up in a situation where the kernel is flakey on these devices
> but with even less incentive for the drivers to be fixed.
> 
> As well as broken drivers, the patch has also identified broken device-tree
> files where DMA-coherent devices weher incorrectly being treated as
> non-coherent:
> 
> https://lore.kernel.org/linux-arm-kernel/20221124142501.29314-1-johan+linaro@kernel.org/
> 
> so I do think it's something that's worth having as the default behaviour.
> 
> > I'll point him to this thread once more in my weekly report anyway.
> > Maybe I'll even suggest to revert this change, not sure yet.
> 
> As I said above, I think the revert makes sense if the drivers are actually
> being fixed, but I'm not seeing any movement at all on that front.
> 
> Will
  
Greg KH Dec. 2, 2022, 4:10 p.m. UTC | #19
On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> On 02.12.22 11:03, Will Deacon wrote:
> > On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> >> On 02.12.22 09:26, Amit Pundir wrote:
> >>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>>>
> >>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> >>>>> Has any progress been made to fix this regression? It afaics is not a
> >>>>> release critical issue, but well, it still would be nice to get this
> >>>>> fixed before 6.1 is released.
> >>>>
> >>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> >>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> >>>> makes it less likely to happen.
> >>>>
> >>>> I like the original commit removing the cache invalidation as it shows
> >>>> drivers not behaving properly
> >>
> >> Yeah, I understand that, but I guess it's my job to ask at this point:
> >> "is continuing to live with the old behavior for one or two more cycles"
> >> that much of a problem"?
> > 
> > That wouldn't be a problem. The problem is that I haven't see any efforts
> > from the Qualcomm side to actually fix the drivers [...]
> 
> Thx for sharing the details. I can fully understand your pain. But well,
> in the end it looks to me like this commit it intentionally breaking
> something that used to work -- which to my understanding of the "no
> regression rule" is not okay, even if things only worked by chance and
> not flawless.

"no regressions" for userspace code, this is broken, out-of-tree driver
code, right?  I do not think any in-kernel drivers have this issue today
from what I can tell, but if I am wrong here, please let me know.

We don't keep stable apis, or even functionality, for out-of-tree kernel
code as that would be impossible for us to do for obvious reasons.

thanks,

greg kh
  
Linux regression tracking (Thorsten Leemhuis) Dec. 2, 2022, 4:27 p.m. UTC | #20
On 02.12.22 17:10, Greg KH wrote:
> On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
>> On 02.12.22 11:03, Will Deacon wrote:
>>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
>>>> On 02.12.22 09:26, Amit Pundir wrote:
>>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
>>>>>>
>>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
>>>>>>> Has any progress been made to fix this regression? It afaics is not a
>>>>>>> release critical issue, but well, it still would be nice to get this
>>>>>>> fixed before 6.1 is released.
>>>>>>
>>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
>>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
>>>>>> makes it less likely to happen.
>>>>>>
>>>>>> I like the original commit removing the cache invalidation as it shows
>>>>>> drivers not behaving properly
>>>>
>>>> Yeah, I understand that, but I guess it's my job to ask at this point:
>>>> "is continuing to live with the old behavior for one or two more cycles"
>>>> that much of a problem"?
>>>
>>> That wouldn't be a problem. The problem is that I haven't see any efforts
>>> from the Qualcomm side to actually fix the drivers [...]
>>
>> Thx for sharing the details. I can fully understand your pain. But well,
>> in the end it looks to me like this commit it intentionally breaking
>> something that used to work -- which to my understanding of the "no
>> regression rule" is not okay, even if things only worked by chance and
>> not flawless.
> 
> "no regressions" for userspace code, this is broken, out-of-tree driver
> code, right?

If so: apologies. But that's not the impression I got, as Amit wrote "I
can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
drivers." here:
https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/

>  I do not think any in-kernel drivers have this issue today
> from what I can tell, but if I am wrong here, please let me know.
> 
> We don't keep stable apis, or even functionality, for out-of-tree kernel
> code as that would be impossible for us to do for obvious reasons.

Ciao, Thorsten
  
Greg KH Dec. 2, 2022, 4:32 p.m. UTC | #21
On Fri, Dec 02, 2022 at 05:27:24PM +0100, Thorsten Leemhuis wrote:
> 
> 
> On 02.12.22 17:10, Greg KH wrote:
> > On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> >> On 02.12.22 11:03, Will Deacon wrote:
> >>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> >>>> On 02.12.22 09:26, Amit Pundir wrote:
> >>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >>>>>>
> >>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> >>>>>>> Has any progress been made to fix this regression? It afaics is not a
> >>>>>>> release critical issue, but well, it still would be nice to get this
> >>>>>>> fixed before 6.1 is released.
> >>>>>>
> >>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> >>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> >>>>>> makes it less likely to happen.
> >>>>>>
> >>>>>> I like the original commit removing the cache invalidation as it shows
> >>>>>> drivers not behaving properly
> >>>>
> >>>> Yeah, I understand that, but I guess it's my job to ask at this point:
> >>>> "is continuing to live with the old behavior for one or two more cycles"
> >>>> that much of a problem"?
> >>>
> >>> That wouldn't be a problem. The problem is that I haven't see any efforts
> >>> from the Qualcomm side to actually fix the drivers [...]
> >>
> >> Thx for sharing the details. I can fully understand your pain. But well,
> >> in the end it looks to me like this commit it intentionally breaking
> >> something that used to work -- which to my understanding of the "no
> >> regression rule" is not okay, even if things only worked by chance and
> >> not flawless.
> > 
> > "no regressions" for userspace code, this is broken, out-of-tree driver
> > code, right?
> 
> If so: apologies. But that's not the impression I got, as Amit wrote "I
> can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
> drivers." here:
> https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/

Ah, I missed that.

Ok, what in-tree drivers are having problems being buggy?  I can't seem
to figure that out from that report at all.  Does anyone know?

thanks,

greg k-h
  
Manivannan Sadhasivam Dec. 2, 2022, 5:14 p.m. UTC | #22
On Fri, Dec 02, 2022 at 05:32:51PM +0100, Greg KH wrote:
> On Fri, Dec 02, 2022 at 05:27:24PM +0100, Thorsten Leemhuis wrote:
> > 
> > 
> > On 02.12.22 17:10, Greg KH wrote:
> > > On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> > >> On 02.12.22 11:03, Will Deacon wrote:
> > >>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> > >>>> On 02.12.22 09:26, Amit Pundir wrote:
> > >>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > >>>>>>
> > >>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > >>>>>>> Has any progress been made to fix this regression? It afaics is not a
> > >>>>>>> release critical issue, but well, it still would be nice to get this
> > >>>>>>> fixed before 6.1 is released.
> > >>>>>>
> > >>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> > >>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> > >>>>>> makes it less likely to happen.
> > >>>>>>
> > >>>>>> I like the original commit removing the cache invalidation as it shows
> > >>>>>> drivers not behaving properly
> > >>>>
> > >>>> Yeah, I understand that, but I guess it's my job to ask at this point:
> > >>>> "is continuing to live with the old behavior for one or two more cycles"
> > >>>> that much of a problem"?
> > >>>
> > >>> That wouldn't be a problem. The problem is that I haven't see any efforts
> > >>> from the Qualcomm side to actually fix the drivers [...]
> > >>
> > >> Thx for sharing the details. I can fully understand your pain. But well,
> > >> in the end it looks to me like this commit it intentionally breaking
> > >> something that used to work -- which to my understanding of the "no
> > >> regression rule" is not okay, even if things only worked by chance and
> > >> not flawless.
> > > 
> > > "no regressions" for userspace code, this is broken, out-of-tree driver
> > > code, right?
> > 
> > If so: apologies. But that's not the impression I got, as Amit wrote "I
> > can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
> > drivers." here:
> > https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/
> 
> Ah, I missed that.
> 
> Ok, what in-tree drivers are having problems being buggy?  I can't seem
> to figure that out from that report at all.  Does anyone know?
> 

It is the Qualcomm Q6V5_MSS remoteproc driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c

Qualcomm is working on the fix but the patches are not ready yet. So if we can
get this patch reverted in the meantime, that would be helpful.

Thanks,
Mani

> thanks,
> 
> greg k-h
  
Will Deacon Dec. 5, 2022, 2:24 p.m. UTC | #23
On Fri, Dec 02, 2022 at 10:44:37PM +0530, Manivannan Sadhasivam wrote:
> On Fri, Dec 02, 2022 at 05:32:51PM +0100, Greg KH wrote:
> > On Fri, Dec 02, 2022 at 05:27:24PM +0100, Thorsten Leemhuis wrote:
> > > On 02.12.22 17:10, Greg KH wrote:
> > > > On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> > > >> On 02.12.22 11:03, Will Deacon wrote:
> > > >>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> > > >>>> On 02.12.22 09:26, Amit Pundir wrote:
> > > >>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > >>>>>>
> > > >>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > > >>>>>>> Has any progress been made to fix this regression? It afaics is not a
> > > >>>>>>> release critical issue, but well, it still would be nice to get this
> > > >>>>>>> fixed before 6.1 is released.
> > > >>>>>>
> > > >>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> > > >>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> > > >>>>>> makes it less likely to happen.
> > > >>>>>>
> > > >>>>>> I like the original commit removing the cache invalidation as it shows
> > > >>>>>> drivers not behaving properly
> > > >>>>
> > > >>>> Yeah, I understand that, but I guess it's my job to ask at this point:
> > > >>>> "is continuing to live with the old behavior for one or two more cycles"
> > > >>>> that much of a problem"?
> > > >>>
> > > >>> That wouldn't be a problem. The problem is that I haven't see any efforts
> > > >>> from the Qualcomm side to actually fix the drivers [...]
> > > >>
> > > >> Thx for sharing the details. I can fully understand your pain. But well,
> > > >> in the end it looks to me like this commit it intentionally breaking
> > > >> something that used to work -- which to my understanding of the "no
> > > >> regression rule" is not okay, even if things only worked by chance and
> > > >> not flawless.
> > > > 
> > > > "no regressions" for userspace code, this is broken, out-of-tree driver
> > > > code, right?
> > > 
> > > If so: apologies. But that's not the impression I got, as Amit wrote "I
> > > can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
> > > drivers." here:
> > > https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/
> > 
> > Ah, I missed that.
> > 
> > Ok, what in-tree drivers are having problems being buggy?  I can't seem
> > to figure that out from that report at all.  Does anyone know?
> > 
> 
> It is the Qualcomm Q6V5_MSS remoteproc driver:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c
> 
> Qualcomm is working on the fix but the patches are not ready yet. So if we can
> get this patch reverted in the meantime, that would be helpful.

It's good to hear that you're working to fix this, even if it's happening
behind closed doors. Do you have a rough idea how soon you'll be able to
post the remoteproc driver fixes? That would help us to figure out when
to bring back the change if we were to revert it.

Cheers,

Will
  
Manivannan Sadhasivam Dec. 6, 2022, 9:21 a.m. UTC | #24
On Mon, Dec 05, 2022 at 02:24:03PM +0000, Will Deacon wrote:
> On Fri, Dec 02, 2022 at 10:44:37PM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Dec 02, 2022 at 05:32:51PM +0100, Greg KH wrote:
> > > On Fri, Dec 02, 2022 at 05:27:24PM +0100, Thorsten Leemhuis wrote:
> > > > On 02.12.22 17:10, Greg KH wrote:
> > > > > On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> > > > >> On 02.12.22 11:03, Will Deacon wrote:
> > > > >>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> > > > >>>> On 02.12.22 09:26, Amit Pundir wrote:
> > > > >>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > >>>>>>
> > > > >>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > > > >>>>>>> Has any progress been made to fix this regression? It afaics is not a
> > > > >>>>>>> release critical issue, but well, it still would be nice to get this
> > > > >>>>>>> fixed before 6.1 is released.
> > > > >>>>>>
> > > > >>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> > > > >>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> > > > >>>>>> makes it less likely to happen.
> > > > >>>>>>
> > > > >>>>>> I like the original commit removing the cache invalidation as it shows
> > > > >>>>>> drivers not behaving properly
> > > > >>>>
> > > > >>>> Yeah, I understand that, but I guess it's my job to ask at this point:
> > > > >>>> "is continuing to live with the old behavior for one or two more cycles"
> > > > >>>> that much of a problem"?
> > > > >>>
> > > > >>> That wouldn't be a problem. The problem is that I haven't see any efforts
> > > > >>> from the Qualcomm side to actually fix the drivers [...]
> > > > >>
> > > > >> Thx for sharing the details. I can fully understand your pain. But well,
> > > > >> in the end it looks to me like this commit it intentionally breaking
> > > > >> something that used to work -- which to my understanding of the "no
> > > > >> regression rule" is not okay, even if things only worked by chance and
> > > > >> not flawless.
> > > > > 
> > > > > "no regressions" for userspace code, this is broken, out-of-tree driver
> > > > > code, right?
> > > > 
> > > > If so: apologies. But that's not the impression I got, as Amit wrote "I
> > > > can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
> > > > drivers." here:
> > > > https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/
> > > 
> > > Ah, I missed that.
> > > 
> > > Ok, what in-tree drivers are having problems being buggy?  I can't seem
> > > to figure that out from that report at all.  Does anyone know?
> > > 
> > 
> > It is the Qualcomm Q6V5_MSS remoteproc driver:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c
> > 
> > Qualcomm is working on the fix but the patches are not ready yet. So if we can
> > get this patch reverted in the meantime, that would be helpful.
> 
> It's good to hear that you're working to fix this, even if it's happening
> behind closed doors. Do you have a rough idea how soon you'll be able to
> post the remoteproc driver fixes? That would help us to figure out when
> to bring back the change if we were to revert it.
> 

Sibi is the one working on the fix. I believe he should be able to post the
patches within this week.

Thanks,
Mani

> Cheers,
> 
> Will
  
Will Deacon Dec. 6, 2022, 9:58 a.m. UTC | #25
On Tue, Dec 06, 2022 at 02:51:52PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Dec 05, 2022 at 02:24:03PM +0000, Will Deacon wrote:
> > On Fri, Dec 02, 2022 at 10:44:37PM +0530, Manivannan Sadhasivam wrote:
> > > On Fri, Dec 02, 2022 at 05:32:51PM +0100, Greg KH wrote:
> > > > On Fri, Dec 02, 2022 at 05:27:24PM +0100, Thorsten Leemhuis wrote:
> > > > > On 02.12.22 17:10, Greg KH wrote:
> > > > > > On Fri, Dec 02, 2022 at 11:34:30AM +0100, Thorsten Leemhuis wrote:
> > > > > >> On 02.12.22 11:03, Will Deacon wrote:
> > > > > >>> On Fri, Dec 02, 2022 at 09:54:05AM +0100, Thorsten Leemhuis wrote:
> > > > > >>>> On 02.12.22 09:26, Amit Pundir wrote:
> > > > > >>>>> On Thu, 1 Dec 2022 at 23:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Dec 01, 2022 at 10:29:39AM +0100, Thorsten Leemhuis wrote:
> > > > > >>>>>>> Has any progress been made to fix this regression? It afaics is not a
> > > > > >>>>>>> release critical issue, but well, it still would be nice to get this
> > > > > >>>>>>> fixed before 6.1 is released.
> > > > > >>>>>>
> > > > > >>>>>> The only (nearly) risk-free "fix" for 6.1 would be to revert the commit
> > > > > >>>>>> that exposed the driver bug. It doesn't fix the actual bug, it only
> > > > > >>>>>> makes it less likely to happen.
> > > > > >>>>>>
> > > > > >>>>>> I like the original commit removing the cache invalidation as it shows
> > > > > >>>>>> drivers not behaving properly
> > > > > >>>>
> > > > > >>>> Yeah, I understand that, but I guess it's my job to ask at this point:
> > > > > >>>> "is continuing to live with the old behavior for one or two more cycles"
> > > > > >>>> that much of a problem"?
> > > > > >>>
> > > > > >>> That wouldn't be a problem. The problem is that I haven't see any efforts
> > > > > >>> from the Qualcomm side to actually fix the drivers [...]
> > > > > >>
> > > > > >> Thx for sharing the details. I can fully understand your pain. But well,
> > > > > >> in the end it looks to me like this commit it intentionally breaking
> > > > > >> something that used to work -- which to my understanding of the "no
> > > > > >> regression rule" is not okay, even if things only worked by chance and
> > > > > >> not flawless.
> > > > > > 
> > > > > > "no regressions" for userspace code, this is broken, out-of-tree driver
> > > > > > code, right?
> > > > > 
> > > > > If so: apologies. But that's not the impression I got, as Amit wrote "I
> > > > > can reproduce this crash on vanilla v6.1-rc1 as well with no out-of-tree
> > > > > drivers." here:
> > > > > https://lore.kernel.org/linux-arm-kernel/CAMi1Hd3H2k1J8hJ6e-Miy5+nVDNzv6qQ3nN-9929B0GbHJkXEg@mail.gmail.com/
> > > > 
> > > > Ah, I missed that.
> > > > 
> > > > Ok, what in-tree drivers are having problems being buggy?  I can't seem
> > > > to figure that out from that report at all.  Does anyone know?
> > > > 
> > > 
> > > It is the Qualcomm Q6V5_MSS remoteproc driver:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/remoteproc/qcom_q6v5_mss.c
> > > 
> > > Qualcomm is working on the fix but the patches are not ready yet. So if we can
> > > get this patch reverted in the meantime, that would be helpful.
> > 
> > It's good to hear that you're working to fix this, even if it's happening
> > behind closed doors. Do you have a rough idea how soon you'll be able to
> > post the remoteproc driver fixes? That would help us to figure out when
> > to bring back the change if we were to revert it.
> > 
> 
> Sibi is the one working on the fix. I believe he should be able to post the
> patches within this week.

Oh nice, that's a lot sooner than I expected! I'll send a revert out now,
with a comment about where we're at.

Will
  
Leonard Lausen Dec. 8, 2022, 4:59 a.m. UTC | #26
Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> writes:
> This reverts commit c44094eee32f32f175aadc0efcac449d99b1bbf7.
>
> As reported by Amit [1], dropping cache invalidation from
> arch_dma_prep_coherent() triggers a crash on the Qualcomm SM8250 platform
> (most probably on other Qcom platforms too).

On sc7180 with c44094ee applied, it does not trigger crash but makes
Wifi dysfunctional by preventing initialization of ath10k_snoc.

qcom-q6v5-mss 4080000.remoteproc: PBL returned unexpected status -284098560

With the revert of c44094ee, wifi works fine again.

Thank you
Leonard
  

Patch

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 3cb101e8cb29..7d7e9a046305 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -36,7 +36,7 @@  void arch_dma_prep_coherent(struct page *page, size_t size)
 {
 	unsigned long start = (unsigned long)page_address(page);
 
-	dcache_clean_poc(start, start + size);
+	dcache_clean_inval_poc(start, start + size);
 }
 
 #ifdef CONFIG_IOMMU_DMA