[15/21] ARM: dma-mapping: always invalidate WT caches before DMA

Message ID 20230327121317.4081816-16-arnd@kernel.org
State New
Headers
Series dma-mapping: unify support for cache flushes |

Commit Message

Arnd Bergmann March 27, 2023, 12:13 p.m. UTC
  From: Arnd Bergmann <arnd@arndb.de>

Most ARM CPUs can have write-back caches and that require
cache management to be done in the dma_sync_*_for_device()
operation. This is typically done in both writeback and
writethrough mode.

The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
(arm920t, arm940t) implementations are the exception here,
and only do the cache management after the DMA is complete,
in the dma_sync_*_for_cpu() operation.

Change this for consistency with the other platforms. This
should have no user visible effect.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/arm/mm/cache-v4.S   | 8 ++++----
 arch/arm/mm/cache-v4wt.S | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)
  

Comments

Linus Walleij March 31, 2023, 9:01 a.m. UTC | #1
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
>
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
>
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
>
> Change this for consistency with the other platforms. This
> should have no user visible effect.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Looks good to me.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
  
Russell King (Oracle) March 31, 2023, 9:07 a.m. UTC | #2
On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
> 
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
> 
> Change this for consistency with the other platforms. This
> should have no user visible effect.

NAK...

The reason we do cache management _after_ is to ensure that there
is no stale data. The kernel _has_ (at the very least in the past)
performed DMA to data structures that are embedded within other
data structures, resulting in cache lines being shared. If one of
those cache lines is touched while DMA is progressing, then we
must to cache management _after_ the DMA operation has completed.
Doing it before is no good.
  
Russell King (Oracle) March 31, 2023, 9:35 a.m. UTC | #3
On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> > 
> > Most ARM CPUs can have write-back caches and that require
> > cache management to be done in the dma_sync_*_for_device()
> > operation. This is typically done in both writeback and
> > writethrough mode.
> > 
> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> > (arm920t, arm940t) implementations are the exception here,
> > and only do the cache management after the DMA is complete,
> > in the dma_sync_*_for_cpu() operation.
> > 
> > Change this for consistency with the other platforms. This
> > should have no user visible effect.
> 
> NAK...
> 
> The reason we do cache management _after_ is to ensure that there
> is no stale data. The kernel _has_ (at the very least in the past)
> performed DMA to data structures that are embedded within other
> data structures, resulting in cache lines being shared. If one of
> those cache lines is touched while DMA is progressing, then we
> must to cache management _after_ the DMA operation has completed.
> Doing it before is no good.

It looks like the main offender of "touching cache lines shared
with DMA" has now been resolved - that was the SCSI sense buffer,
and was fixed some time ago:

commit de25deb18016f66dcdede165d07654559bb332bc
Author: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Date:   Wed Jan 16 13:32:17 2008 +0900

/if/ that is the one and only case, then we're probably fine, but
having been through an era where this kind of thing was the norm
and requests to fix it did not get great responses from subsystem
maintainers, I just don't trust the kernel not to want to DMA to
overlapping cache lines.
  
Arnd Bergmann March 31, 2023, 10:38 a.m. UTC | #4
On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> > From: Arnd Bergmann <arnd@arndb.de>
>> > 
>> > Most ARM CPUs can have write-back caches and that require
>> > cache management to be done in the dma_sync_*_for_device()
>> > operation. This is typically done in both writeback and
>> > writethrough mode.
>> > 
>> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> > (arm920t, arm940t) implementations are the exception here,
>> > and only do the cache management after the DMA is complete,
>> > in the dma_sync_*_for_cpu() operation.
>> > 
>> > Change this for consistency with the other platforms. This
>> > should have no user visible effect.
>> 
>> NAK...
>> 
>> The reason we do cache management _after_ is to ensure that there
>> is no stale data. The kernel _has_ (at the very least in the past)
>> performed DMA to data structures that are embedded within other
>> data structures, resulting in cache lines being shared. If one of
>> those cache lines is touched while DMA is progressing, then we
>> must to cache management _after_ the DMA operation has completed.
>> Doing it before is no good.

What I'm trying to address here is the inconsistency between
implementations. If we decide that we always want to invalidate
after FROM_DEVICE, I can do that as part of the series, but then
I have to change most of the other arm implementations.

Right now, the only WT cache implementations that do the the
invalidation after the DMA are cache-v4.S (arm720 integrator and
clps711x), cache-v4wt.S (arm920/arm922 at91rm9200, clps711x,
ep93xx, omap15xx, imx1 and integrator), some sparc32 leon3 and
early xtensa.

Most architectures that have write-through caches (m68k,
microblaze) or write-back caches but no speculation (all other
armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa)
only invalidate before DMA but not after.

OTOH, most machines that are actually in use today (armv6+,
powerpc, later mips, microblaze, riscv, nios2) also have to
deal with speculative accesses, so they end up having to
invalidate or flush both before and after a DMA_FROM_DEVICE
and DMA_BIDIRECTIONAL.

> It looks like the main offender of "touching cache lines shared
> with DMA" has now been resolved - that was the SCSI sense buffer,
> and was fixed some time ago:
>
> commit de25deb18016f66dcdede165d07654559bb332bc
> Author: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Date:   Wed Jan 16 13:32:17 2008 +0900
>
> /if/ that is the one and only case, then we're probably fine, but
> having been through an era where this kind of thing was the norm
> and requests to fix it did not get great responses from subsystem
> maintainers, I just don't trust the kernel not to want to DMA to
> overlapping cache lines.

Thanks for digging that out, that is very useful. It looks like this
was around the same time as 03d70617b8a7 ("powerpc: Prevent memory
corruption due to cache invalidation of unaligned DMA buffer"), so
it may well have been related. I know we also had more recent 
problems with USB drivers trying to DMA to stack, which would 
also cause problems on non-coherent machines, but some of these were
only found after we introduced VMAP_STACK.

It would be nice to use KASAN prevent reads on cache lines that
have in-flight DMA.

     Arnd
  
Russell King (Oracle) March 31, 2023, 11:08 a.m. UTC | #5
On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> >> > From: Arnd Bergmann <arnd@arndb.de>
> >> > 
> >> > Most ARM CPUs can have write-back caches and that require
> >> > cache management to be done in the dma_sync_*_for_device()
> >> > operation. This is typically done in both writeback and
> >> > writethrough mode.
> >> > 
> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> >> > (arm920t, arm940t) implementations are the exception here,
> >> > and only do the cache management after the DMA is complete,
> >> > in the dma_sync_*_for_cpu() operation.
> >> > 
> >> > Change this for consistency with the other platforms. This
> >> > should have no user visible effect.
> >> 
> >> NAK...
> >> 
> >> The reason we do cache management _after_ is to ensure that there
> >> is no stale data. The kernel _has_ (at the very least in the past)
> >> performed DMA to data structures that are embedded within other
> >> data structures, resulting in cache lines being shared. If one of
> >> those cache lines is touched while DMA is progressing, then we
> >> must to cache management _after_ the DMA operation has completed.
> >> Doing it before is no good.
> 
> What I'm trying to address here is the inconsistency between
> implementations. If we decide that we always want to invalidate
> after FROM_DEVICE, I can do that as part of the series, but then
> I have to change most of the other arm implementations.

Why?

First thing to say is that DMA to buffers where the cache lines are
shared with data the CPU may be accessing need to be outlawed - they
are a recipe for data corruption - always have been. Sadly, some folk
don't see it that way because of a passed "x86 just works and we demand
that all architectures behave like x86!" attitude. The SCSI sense
buffer has historically been a big culpret for that.


For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
because we want to ensure that the DMA'd data is properly readable upon
completion of the DMA. If overlapping cache lines have been touched
while DMA is progressing, and we invalidate before DMA, then the cache
will contain stale data that will remain in the cache after DMA has
completed. Invalidating a WT cache does not destroy any data, so is
safe to do. So the safest approach is to invalidate after DMA has
completed in this instance.


For WB, FROM_DEVICE, we have the problem of dirty cache lines which
we have to get rid of. For the overlapping cache lines, we have to
clean those before DMA begins to ensure that data written to the
non-DMA-buffer part is preserved. All other cache lines need to be
invalidated before DMA begins to ensure that writebacks do not
corrupt data from the device. Hence why it's different.


And hence why the ARM implementation is based around buffer ownership.
And hence why they're called dma_map_area()/dma_unmap_area() rather
than the cache operations themselves. This is an intentional change,
one that was done when ARMv6 came along.

> OTOH, most machines that are actually in use today (armv6+,
> powerpc, later mips, microblaze, riscv, nios2) also have to
> deal with speculative accesses, so they end up having to
> invalidate or flush both before and after a DMA_FROM_DEVICE
> and DMA_BIDIRECTIONAL.

Again, these are implementation details of the cache, and this is
precisely why having the map/unmap interface is so much better than
having generic code explicitly call "clean" and "invalidate"
interfaces into arch code.

If we treat everything as a speculative cache, then we're doing
needless extra work for those caches that aren't speculative. So,
ARM would have to step through every cache line for every DMA
buffer at 32-byte intervals performing cache maintenance whether
the cache is speculative or not. That is expensive, and hurts
performance.

I put a lot of thought into this when I updated the ARM DMA
implementation when we started seeing these different cache types
particularly when ARMv6 came along. I really don't want that work
wrecked.
  
Arnd Bergmann March 31, 2023, 12:32 p.m. UTC | #6
On Fri, Mar 31, 2023, at 13:08, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
>> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
>> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> >> > From: Arnd Bergmann <arnd@arndb.de>
>> >> > 
>> >> > Most ARM CPUs can have write-back caches and that require
>> >> > cache management to be done in the dma_sync_*_for_device()
>> >> > operation. This is typically done in both writeback and
>> >> > writethrough mode.
>> >> > 
>> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> >> > (arm920t, arm940t) implementations are the exception here,
>> >> > and only do the cache management after the DMA is complete,
>> >> > in the dma_sync_*_for_cpu() operation.
>> >> > 
>> >> > Change this for consistency with the other platforms. This
>> >> > should have no user visible effect.
>> >> 
>> >> NAK...So t
>> >> 
>> >> The reason we do cache management _after_ is to ensure that there
>> >> is no stale data. The kernel _has_ (at the very least in the past)
>> >> performed DMA to data structures that are embedded within other
>> >> data structures, resulting in cache lines being shared. If one of
>> >> those cache lines is touched while DMA is progressing, then we
>> >> must to cache management _after_ the DMA operation has completed.
>> >> Doing it before is no good.
>> 
>> What I'm trying to address here is the inconsistency between
>> implementations. If we decide that we always want to invalidate
>> after FROM_DEVICE, I can do that as part of the series, but then
>> I have to change most of the other arm implementations.
>
> Why?
>
> First thing to say is that DMA to buffers where the cache lines are
> shared with data the CPU may be accessing need to be outlawed - they
> are a recipe for data corruption - always have been. Sadly, some folk
> don't see it that way because of a passed "x86 just works and we demand
> that all architectures behave like x86!" attitude. The SCSI sense
> buffer has historically been a big culpret for that.

I think that part is pretty much agree by everyone, the difference
between architectures is to what extend they try to work around
drivers that get it wrong.

> For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
> because we want to ensure that the DMA'd data is properly readable upon
> completion of the DMA. If overlapping cache lines haveDoes that mean you take back you NAK on this patch tehn? been touched
> while DMA is proSo tgressing, and we invalidate before DMA, then the cache
> will contain stale data that will remain in the cache after DMA has
> completed. Invalidating a WT cache does not destroy any data, so is
> safe to do. So the safest approach is to invalidate after DMA has
> completed in this instance.

> For WB, FROM_DEVICE, we have the problem of dirty cache lines which
> we have to get rid of. For the overlapping cache lines, we have to
> clean those before DMA begins to ensure that data written to the
> non-DMA-buffer part is preserved. All other cache lines need to be
> invalidated before DMA begins to ensure that writebacks do not
> corrupt data from the device. Hence why it's different.

I don't see how WB and Wt caches being different implies that we
should give extra guarantees to (broken) drivers when WT caches on
other architectures. Always doing it first in the absence of
prefetching avoids a special case in the generic implementation
and makes the driver interface on Arm/sparc32/xtensa WT caches
no different from what everything provides.

The writeback before DMA_FROM_DEVICE is another issue that we
have to address at some point, as there are clearly incompatible
expectations here. It makes no sense that a device driver can
rely on the entire to be written back on a 64-bit arm kernel
but not on a 32-bit kernel.

> And hence why the ARM implementation is based around buffer ownership.
> And hence why they're called dma_map_area()/dma_unmap_area() rather
> than the cache operations themselves. This is an intentional change,
> one that was done when ARMv6 came along.

The bit that has changed in the meantime though is that the buffer
ownership interfaces has moved up in the stack and is now handled
mostly in the common kernel/dma/*.c that multiplexes between the
direct/iommu/swiotlb dma_map_ops, except for the bit about
noncoherent devices. Right now, we have 37 implementations that
are mostly identical, and all the differences are either bugs
or disagreements about the API guarantees but not related to
architecture specific requirements.

>> OTOH, most machines that are actually in use today (armv6+,
>> powerpc, later mips, microblaze, riscv, nios2) also have to
>> deal with speculative accesses, so they end up having to
>> invalidate or flush both before and after a DMA_FROM_DEVICE
>> and DMA_BIDIRECTIONAL.
>
> Again, these are implementation details of the cache, and this is
> precisely why having the map/unmap interface is so much better than
> having generic code explicitly call "clean" and "invalidate"
> interfaces into arch code.
>
> If we treat everything as a speculative cache, then we're doing
> needless extra work for those caches that aren't speculative. So,
> ARM would have to step through every cache line for every DMA
> buffer at 32-byte intervals performing cache maintenance whether
> the cache is speculative or not. That is expensive, and hurts
> performance.

Dop that mean that you agree with this patch 15 then after all?

If you think we don't need an invalidation after DMA_FROM_DEVICE
on non-speculating CPUs, it should be fine to make the WT case
consistent with the rest.

      Arnd
  

Patch

diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index 7787057e4990..e2b104876340 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -117,23 +117,23 @@  ENTRY(v4_dma_flush_range)
 	ret	lr
 
 /*
- *	dma_unmap_area(start, size, dir)
+ *	dma_map_area(start, size, dir)
  *	- start	- kernel virtual start address
  *	- size	- size of region
  *	- dir	- DMA direction
  */
-ENTRY(v4_dma_unmap_area)
+ENTRY(v4_dma_map_area)
 	teq	r2, #DMA_TO_DEVICE
 	bne	v4_dma_flush_range
 	/* FALLTHROUGH */
 
 /*
- *	dma_map_area(start, size, dir)
+ *	dma_unmap_area(start, size, dir)
  *	- start	- kernel virtual start address
  *	- size	- size of region
  *	- dir	- DMA direction
  */
-ENTRY(v4_dma_map_area)
+ENTRY(v4_dma_unmap_area)
 	ret	lr
 ENDPROC(v4_dma_unmap_area)
 ENDPROC(v4_dma_map_area)
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 0b290c25a99d..652218752f88 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -172,24 +172,24 @@  v4wt_dma_inv_range:
 	.equ	v4wt_dma_flush_range, v4wt_dma_inv_range
 
 /*
- *	dma_unmap_area(start, size, dir)
+ *	dma_map_area(start, size, dir)
  *	- start	- kernel virtual start address
  *	- size	- size of region
  *	- dir	- DMA direction
  */
-ENTRY(v4wt_dma_unmap_area)
+ENTRY(v4wt_dma_map_area)
 	add	r1, r1, r0
 	teq	r2, #DMA_TO_DEVICE
 	bne	v4wt_dma_inv_range
 	/* FALLTHROUGH */
 
 /*
- *	dma_map_area(start, size, dir)
+ *	dma_unmap_area(start, size, dir)
  *	- start	- kernel virtual start address
  *	- size	- size of region
  *	- dir	- DMA direction
  */
-ENTRY(v4wt_dma_map_area)
+ENTRY(v4wt_dma_unmap_area)
 	ret	lr
 ENDPROC(v4wt_dma_unmap_area)
 ENDPROC(v4wt_dma_map_area)