drm: gem: add an option for supporting the dma-coherent hardware.

Message ID 20230607053053.345101-1-suijingfeng@loongson.cn
State New
Headers
Series drm: gem: add an option for supporting the dma-coherent hardware. |

Commit Message

Sui Jingfeng June 7, 2023, 5:30 a.m. UTC
  The single map_noncoherent member of struct drm_gem_dma_object may not
sufficient for describing the backing memory of the GEM buffer object.

Especially on dma-coherent systems, the backing memory is both cached
coherent for multi-core CPUs and dma-coherent for peripheral device.
Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.

Whether a peripheral device is dma-coherent or not can be
implementation-dependent. The single map_noncoherent option is not enough
to reflect real hardware anymore. For example, the Loongson LS3A4000 CPU
and LS2K2000/LS2K1000 SoC, peripheral device of such hardware platform
allways snoop CPU's cache. Doing the allocation with dma_alloc_coherent
function is preferred. The return buffer is cached, it should not using
the default write-combine mapping. While with the current implement, there
no way to tell the drm core to reflect this.

This patch adds cached and coherent members to struct drm_gem_dma_object.
which allow driver implements to inform the core. Introducing new mappings
while keeping the original default behavior unchanged.

Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
---
 drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
 drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c      | 20 ++++++++++++++++----
 drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
 drivers/gpu/drm/rcar-du/Kconfig           |  2 --
 drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
 include/drm/drm_gem_dma_helper.h          |  7 +++++--
 7 files changed, 34 insertions(+), 17 deletions(-)
  

Comments

Paul Cercueil June 7, 2023, 9:36 a.m. UTC | #1
Hi Sui,

Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> The single map_noncoherent member of struct drm_gem_dma_object may
> not
> sufficient for describing the backing memory of the GEM buffer
> object.
> 
> Especially on dma-coherent systems, the backing memory is both cached
> coherent for multi-core CPUs and dma-coherent for peripheral device.
> Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
> 
> Whether a peripheral device is dma-coherent or not can be
> implementation-dependent. The single map_noncoherent option is not
> enough
> to reflect real hardware anymore. For example, the Loongson LS3A4000
> CPU
> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> platform
> allways snoop CPU's cache. Doing the allocation with
> dma_alloc_coherent
> function is preferred. The return buffer is cached, it should not
> using
> the default write-combine mapping. While with the current implement,
> there
> no way to tell the drm core to reflect this.
> 
> This patch adds cached and coherent members to struct
> drm_gem_dma_object.
> which allow driver implements to inform the core. Introducing new
> mappings
> while keeping the original default behavior unchanged.

Did you try to simply set the "dma-coherent" property to the device's
node?

From what I understand if you add that property then Linux will use DMA
coherent memory even though you use dma_alloc_noncoherent() and the
sync_single_for_cpu() / sync_single_for_device() are then NOPs.

Cheers,
-Paul

> Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
> ---
>  drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
>  drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
>  drivers/gpu/drm/drm_gem_dma_helper.c      | 20 ++++++++++++++++----
>  drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
>  drivers/gpu/drm/rcar-du/Kconfig           |  2 --
>  drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
>  include/drm/drm_gem_dma_helper.h          |  7 +++++--
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
> b/drivers/gpu/drm/drm_fb_dma_helper.c
> index 3b535ad1b07c..93ff05041192 100644
> --- a/drivers/gpu/drm/drm_fb_dma_helper.c
> +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
> @@ -106,16 +106,15 @@ dma_addr_t drm_fb_dma_get_gem_addr(struct
> drm_framebuffer *fb,
>  EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
>  
>  /**
> - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-coherent
> backing
> - *     memory
> + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached backing
> memory
>   * @drm: DRM device
>   * @old_state: Old plane state
>   * @state: New plane state
>   *
>   * This function can be used by drivers that use damage clips and
> have
> - * DMA GEM objects backed by non-coherent memory. Calling this
> function
> - * in a plane's .atomic_update ensures that all the data in the
> backing
> - * memory have been written to RAM.
> + * DMA GEM objects backed by cached memory. Calling this function in
> a
> + * plane's .atomic_update ensures that all the data in the backing
> memory
> + * have been written to RAM.
>   */
>  void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
>                                   struct drm_plane_state *old_state,
> @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
> drm_device *drm,
>  
>         for (i = 0; i < finfo->num_planes; i++) {
>                 dma_obj = drm_fb_dma_get_gem_obj(state->fb, i);
> -               if (!dma_obj->map_noncoherent)
> +               if (dma_obj->cached && dma_obj->coherent)
>                         continue;
>  
>                 daddr = drm_fb_dma_get_gem_addr(state->fb, state, i);
> diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
> b/drivers/gpu/drm/drm_fbdev_dma.c
> index d86773fa8ab0..49fe9b284cc8 100644
> --- a/drivers/gpu/drm/drm_fbdev_dma.c
> +++ b/drivers/gpu/drm/drm_fbdev_dma.c
> @@ -131,7 +131,7 @@ static int drm_fbdev_dma_helper_fb_probe(struct
> drm_fb_helper *fb_helper,
>  
>         /* screen */
>         info->flags |= FBINFO_VIRTFB; /* system memory */
> -       if (dma_obj->map_noncoherent)
> +       if (dma_obj->cached)
>                 info->flags |= FBINFO_READS_FAST; /* signal caching
> */
>         info->screen_size = sizes->surface_height * fb->pitches[0];
>         info->screen_buffer = map.vaddr;
> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
> b/drivers/gpu/drm/drm_gem_dma_helper.c
> index 870b90b78bc4..dec1d512bdf1 100644
> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
> @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device *drm,
> size_t size, bool private)
>                 drm_gem_private_object_init(drm, gem_obj, size);
>  
>                 /* Always use writecombine for dma-buf mappings */
> -               dma_obj->map_noncoherent = false;
> +               /* FIXME: This is not always true, on some dma
> coherent system,
> +                * cached mappings should be preferred over
> writecombine
> +                */
> +               dma_obj->cached = false;
> +               dma_obj->coherent = false;
>         } else {
>                 ret = drm_gem_object_init(drm, gem_obj, size);
>         }
> @@ -143,7 +147,11 @@ struct drm_gem_dma_object
> *drm_gem_dma_create(struct drm_device *drm,
>         if (IS_ERR(dma_obj))
>                 return dma_obj;
>  
> -       if (dma_obj->map_noncoherent) {
> +       if (dma_obj->cached && dma_obj->coherent) {
> +               dma_obj->vaddr = dma_alloc_coherent(drm->dev, size,
> +                                                   &dma_obj-
> >dma_addr,
> +                                                   GFP_KERNEL |
> __GFP_NOWARN);
> +       } else if (dma_obj->cached && !dma_obj->coherent) {
>                 dma_obj->vaddr = dma_alloc_noncoherent(drm->dev,
> size,
>                                                        &dma_obj-
> >dma_addr,
>                                                        DMA_TO_DEVICE,
> @@ -153,6 +161,7 @@ struct drm_gem_dma_object
> *drm_gem_dma_create(struct drm_device *drm,
>                                               &dma_obj->dma_addr,
>                                               GFP_KERNEL |
> __GFP_NOWARN);
>         }
> +
>         if (!dma_obj->vaddr) {
>                 drm_dbg(drm, "failed to allocate buffer with size
> %zu\n",
>                          size);
> @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct drm_gem_dma_object
> *dma_obj)
>                         dma_buf_vunmap_unlocked(gem_obj-
> >import_attach->dmabuf, &map);
>                 drm_prime_gem_destroy(gem_obj, dma_obj->sgt);
>         } else if (dma_obj->vaddr) {
> -               if (dma_obj->map_noncoherent)
> +               if (dma_obj->cached && dma_obj->coherent)
> +                       dma_free_coherent(gem_obj->dev->dev, dma_obj-
> >base.size,
> +                                         dma_obj->vaddr, dma_obj-
> >dma_addr);
> +               else if (dma_obj->cached && !dma_obj->coherent)
>                         dma_free_noncoherent(gem_obj->dev->dev,
> dma_obj->base.size,
>                                              dma_obj->vaddr, dma_obj-
> >dma_addr,
>                                              DMA_TO_DEVICE);
> @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct drm_gem_dma_object
> *dma_obj, struct vm_area_struct *
>         vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
>         vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
>  
> -       if (dma_obj->map_noncoherent) {
> +       if (dma_obj->cached) {
>                 vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>  
>                 ret = dma_mmap_pages(dma_obj->base.dev->dev,
> diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> index 5ec75e9ba499..a3df2f99a757 100644
> --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct drm_device
> *drm, size_t size)
>         if (!obj)
>                 return ERR_PTR(-ENOMEM);
>  
> -       obj->map_noncoherent = priv->soc_info->map_noncoherent;
> +       if (priv->soc_info->map_noncoherent) {
> +               obj->cached = true;
> +               obj->coherent = false;
> +       }
>  
>         return &obj->base;
>  }
> diff --git a/drivers/gpu/drm/rcar-du/Kconfig b/drivers/gpu/drm/rcar-
> du/Kconfig
> index 53c356aed5d5..dddc70c08bdc 100644
> --- a/drivers/gpu/drm/rcar-du/Kconfig
> +++ b/drivers/gpu/drm/rcar-du/Kconfig
> @@ -2,8 +2,6 @@
>  config DRM_RCAR_DU
>         tristate "DRM Support for R-Car Display Unit"
>         depends on DRM && OF
> -       depends on ARM || ARM64
> -       depends on ARCH_RENESAS || COMPILE_TEST
>         select DRM_KMS_HELPER
>         select DRM_GEM_DMA_HELPER
>         select VIDEOMODE_HELPERS
> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> index adfb36b0e815..1142d51473e6 100644
> --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> @@ -386,7 +386,9 @@ struct drm_gem_object
> *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
>         gem_obj->funcs = &rcar_du_gem_funcs;
>  
>         drm_gem_private_object_init(dev, gem_obj, attach->dmabuf-
> >size);
> -       dma_obj->map_noncoherent = false;
> +
> +       dma_obj->cached = false;
> +       dma_obj->coherent = false;
>  
>         ret = drm_gem_create_mmap_offset(gem_obj);
>         if (ret) {
> diff --git a/include/drm/drm_gem_dma_helper.h
> b/include/drm/drm_gem_dma_helper.h
> index 8a043235dad8..585ce3d4d1eb 100644
> --- a/include/drm/drm_gem_dma_helper.h
> +++ b/include/drm/drm_gem_dma_helper.h
> @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
>   *       more than one entry but they are guaranteed to have
> contiguous
>   *       DMA addresses.
>   * @vaddr: kernel virtual address of the backing memory
> - * @map_noncoherent: if true, the GEM object is backed by non-
> coherent memory
> + * @cached: if true, the GEM object is backed by cached memory
> + * @coherent: This option only meaningful when a GEM object is
> cached.
> + *            If true, Sync the GEM object for DMA access is not
> required.
>   */
>  struct drm_gem_dma_object {
>         struct drm_gem_object base;
> @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
>         /* For objects with DMA memory allocated by GEM DMA */
>         void *vaddr;
>  
> -       bool map_noncoherent;
> +       bool cached;
> +       bool coherent;
>  };
>  
>  #define to_drm_gem_dma_obj(gem_obj) \
  
Sui Jingfeng June 7, 2023, 10:30 a.m. UTC | #2
Hi,


On 2023/6/7 17:36, Paul Cercueil wrote:
> Hi Sui,
>
> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>> The single map_noncoherent member of struct drm_gem_dma_object may
>> not
>> sufficient for describing the backing memory of the GEM buffer
>> object.
>>
>> Especially on dma-coherent systems, the backing memory is both cached
>> coherent for multi-core CPUs and dma-coherent for peripheral device.
>> Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
>>
>> Whether a peripheral device is dma-coherent or not can be
>> implementation-dependent. The single map_noncoherent option is not
>> enough
>> to reflect real hardware anymore. For example, the Loongson LS3A4000
>> CPU
>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>> platform
>> allways snoop CPU's cache. Doing the allocation with
>> dma_alloc_coherent
>> function is preferred. The return buffer is cached, it should not
>> using
>> the default write-combine mapping. While with the current implement,
>> there
>> no way to tell the drm core to reflect this.
>>
>> This patch adds cached and coherent members to struct
>> drm_gem_dma_object.
>> which allow driver implements to inform the core. Introducing new
>> mappings
>> while keeping the original default behavior unchanged.
> Did you try to simply set the "dma-coherent" property to the device's
> node?

But this approach can only be applied for the device driver with DT support.

X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically do not 
have DT support.

They using ACPI to pass parameter from the firmware to Linux kernel.

You approach will lost the effectiveness on such a case.

>  From what I understand if you add that property then Linux will use DMA
> coherent memory even though you use dma_alloc_noncoherent() and the
> sync_single_for_cpu() / sync_single_for_device() are then NOPs.

Please do not mitigate the problems with confusing method.


This approach not only tend to generate confusion but also 
implement-dependent

and arch-dependent. It's definitely problematic.


How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH specific 
thing.

Dependent on how does the arch_dma_ops is implemented.


The definition of the coherent on different ARCH has different meanings.

The definition of the wirte-combine on different ARCH has different 
meanings.


The wirte-combine(uncache acceleration) on mips is non dma-coherent.

But on arm, It seem that wirte-combine is coherent. (guaranteed by arch 
implement).


I also heard using dma_alloc_coherent  to allocation the buffer for the 
non-coherent doesn't hurt, but the reverse is not true.


But please do not create confusion.

software composite is faster because better cacheusing rate and

cache is faster to read.

It is faster because it is cached, not because it is non-coherent.

non-coherent is arch thing and/or driver-side thing,

it is a side effect of  using the cached mapping.


It should left to driver to handle such a side effect. The device driver

know their device, so its the device driver's responsibility to maintain

the coherency.  On loongson platform, we don't need to call 
drm_fb_dma_sync_non_coherent() function, Its already guaranteed by hardware.


> Cheers,
> -Paul
>
>> Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
>> ---
>>   drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
>>   drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
>>   drivers/gpu/drm/drm_gem_dma_helper.c      | 20 ++++++++++++++++----
>>   drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
>>   drivers/gpu/drm/rcar-du/Kconfig           |  2 --
>>   drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
>>   include/drm/drm_gem_dma_helper.h          |  7 +++++--
>>   7 files changed, 34 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
>> b/drivers/gpu/drm/drm_fb_dma_helper.c
>> index 3b535ad1b07c..93ff05041192 100644
>> --- a/drivers/gpu/drm/drm_fb_dma_helper.c
>> +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
>> @@ -106,16 +106,15 @@ dma_addr_t drm_fb_dma_get_gem_addr(struct
>> drm_framebuffer *fb,
>>   EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
>>   
>>   /**
>> - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-coherent
>> backing
>> - *     memory
>> + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached backing
>> memory
>>    * @drm: DRM device
>>    * @old_state: Old plane state
>>    * @state: New plane state
>>    *
>>    * This function can be used by drivers that use damage clips and
>> have
>> - * DMA GEM objects backed by non-coherent memory. Calling this
>> function
>> - * in a plane's .atomic_update ensures that all the data in the
>> backing
>> - * memory have been written to RAM.
>> + * DMA GEM objects backed by cached memory. Calling this function in
>> a
>> + * plane's .atomic_update ensures that all the data in the backing
>> memory
>> + * have been written to RAM.
>>    */
>>   void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
>>                                    struct drm_plane_state *old_state,
>> @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
>> drm_device *drm,
>>   
>>          for (i = 0; i < finfo->num_planes; i++) {
>>                  dma_obj = drm_fb_dma_get_gem_obj(state->fb, i);
>> -               if (!dma_obj->map_noncoherent)
>> +               if (dma_obj->cached && dma_obj->coherent)
>>                          continue;
>>   
>>                  daddr = drm_fb_dma_get_gem_addr(state->fb, state, i);
>> diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
>> b/drivers/gpu/drm/drm_fbdev_dma.c
>> index d86773fa8ab0..49fe9b284cc8 100644
>> --- a/drivers/gpu/drm/drm_fbdev_dma.c
>> +++ b/drivers/gpu/drm/drm_fbdev_dma.c
>> @@ -131,7 +131,7 @@ static int drm_fbdev_dma_helper_fb_probe(struct
>> drm_fb_helper *fb_helper,
>>   
>>          /* screen */
>>          info->flags |= FBINFO_VIRTFB; /* system memory */
>> -       if (dma_obj->map_noncoherent)
>> +       if (dma_obj->cached)
>>                  info->flags |= FBINFO_READS_FAST; /* signal caching
>> */
>>          info->screen_size = sizes->surface_height * fb->pitches[0];
>>          info->screen_buffer = map.vaddr;
>> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
>> b/drivers/gpu/drm/drm_gem_dma_helper.c
>> index 870b90b78bc4..dec1d512bdf1 100644
>> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
>> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
>> @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device *drm,
>> size_t size, bool private)
>>                  drm_gem_private_object_init(drm, gem_obj, size);
>>   
>>                  /* Always use writecombine for dma-buf mappings */
>> -               dma_obj->map_noncoherent = false;
>> +               /* FIXME: This is not always true, on some dma
>> coherent system,
>> +                * cached mappings should be preferred over
>> writecombine
>> +                */
>> +               dma_obj->cached = false;
>> +               dma_obj->coherent = false;
>>          } else {
>>                  ret = drm_gem_object_init(drm, gem_obj, size);
>>          }
>> @@ -143,7 +147,11 @@ struct drm_gem_dma_object
>> *drm_gem_dma_create(struct drm_device *drm,
>>          if (IS_ERR(dma_obj))
>>                  return dma_obj;
>>   
>> -       if (dma_obj->map_noncoherent) {
>> +       if (dma_obj->cached && dma_obj->coherent) {
>> +               dma_obj->vaddr = dma_alloc_coherent(drm->dev, size,
>> +                                                   &dma_obj-
>>> dma_addr,
>> +                                                   GFP_KERNEL |
>> __GFP_NOWARN);
>> +       } else if (dma_obj->cached && !dma_obj->coherent) {
>>                  dma_obj->vaddr = dma_alloc_noncoherent(drm->dev,
>> size,
>>                                                         &dma_obj-
>>> dma_addr,
>>                                                         DMA_TO_DEVICE,
>> @@ -153,6 +161,7 @@ struct drm_gem_dma_object
>> *drm_gem_dma_create(struct drm_device *drm,
>>                                                &dma_obj->dma_addr,
>>                                                GFP_KERNEL |
>> __GFP_NOWARN);
>>          }
>> +
>>          if (!dma_obj->vaddr) {
>>                  drm_dbg(drm, "failed to allocate buffer with size
>> %zu\n",
>>                           size);
>> @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct drm_gem_dma_object
>> *dma_obj)
>>                          dma_buf_vunmap_unlocked(gem_obj-
>>> import_attach->dmabuf, &map);
>>                  drm_prime_gem_destroy(gem_obj, dma_obj->sgt);
>>          } else if (dma_obj->vaddr) {
>> -               if (dma_obj->map_noncoherent)
>> +               if (dma_obj->cached && dma_obj->coherent)
>> +                       dma_free_coherent(gem_obj->dev->dev, dma_obj-
>>> base.size,
>> +                                         dma_obj->vaddr, dma_obj-
>>> dma_addr);
>> +               else if (dma_obj->cached && !dma_obj->coherent)
>>                          dma_free_noncoherent(gem_obj->dev->dev,
>> dma_obj->base.size,
>>                                               dma_obj->vaddr, dma_obj-
>>> dma_addr,
>>                                               DMA_TO_DEVICE);
>> @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct drm_gem_dma_object
>> *dma_obj, struct vm_area_struct *
>>          vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
>>          vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
>>   
>> -       if (dma_obj->map_noncoherent) {
>> +       if (dma_obj->cached) {
>>                  vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>>   
>>                  ret = dma_mmap_pages(dma_obj->base.dev->dev,
>> diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>> b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>> index 5ec75e9ba499..a3df2f99a757 100644
>> --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>> +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>> @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct drm_device
>> *drm, size_t size)
>>          if (!obj)
>>                  return ERR_PTR(-ENOMEM);
>>   
>> -       obj->map_noncoherent = priv->soc_info->map_noncoherent;
>> +       if (priv->soc_info->map_noncoherent) {
>> +               obj->cached = true;
>> +               obj->coherent = false;
>> +       }
>>   
>>          return &obj->base;
>>   }
>> diff --git a/drivers/gpu/drm/rcar-du/Kconfig b/drivers/gpu/drm/rcar-
>> du/Kconfig
>> index 53c356aed5d5..dddc70c08bdc 100644
>> --- a/drivers/gpu/drm/rcar-du/Kconfig
>> +++ b/drivers/gpu/drm/rcar-du/Kconfig
>> @@ -2,8 +2,6 @@
>>   config DRM_RCAR_DU
>>          tristate "DRM Support for R-Car Display Unit"
>>          depends on DRM && OF
>> -       depends on ARM || ARM64
>> -       depends on ARCH_RENESAS || COMPILE_TEST
>>          select DRM_KMS_HELPER
>>          select DRM_GEM_DMA_HELPER
>>          select VIDEOMODE_HELPERS
>> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>> b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>> index adfb36b0e815..1142d51473e6 100644
>> --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>> +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>> @@ -386,7 +386,9 @@ struct drm_gem_object
>> *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
>>          gem_obj->funcs = &rcar_du_gem_funcs;
>>   
>>          drm_gem_private_object_init(dev, gem_obj, attach->dmabuf-
>>> size);
>> -       dma_obj->map_noncoherent = false;
>> +
>> +       dma_obj->cached = false;
>> +       dma_obj->coherent = false;
>>   
>>          ret = drm_gem_create_mmap_offset(gem_obj);
>>          if (ret) {
>> diff --git a/include/drm/drm_gem_dma_helper.h
>> b/include/drm/drm_gem_dma_helper.h
>> index 8a043235dad8..585ce3d4d1eb 100644
>> --- a/include/drm/drm_gem_dma_helper.h
>> +++ b/include/drm/drm_gem_dma_helper.h
>> @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
>>    *       more than one entry but they are guaranteed to have
>> contiguous
>>    *       DMA addresses.
>>    * @vaddr: kernel virtual address of the backing memory
>> - * @map_noncoherent: if true, the GEM object is backed by non-
>> coherent memory
>> + * @cached: if true, the GEM object is backed by cached memory
>> + * @coherent: This option only meaningful when a GEM object is
>> cached.
>> + *            If true, Sync the GEM object for DMA access is not
>> required.
>>    */
>>   struct drm_gem_dma_object {
>>          struct drm_gem_object base;
>> @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
>>          /* For objects with DMA memory allocated by GEM DMA */
>>          void *vaddr;
>>   
>> -       bool map_noncoherent;
>> +       bool cached;
>> +       bool coherent;
>>   };
>>   
>>   #define to_drm_gem_dma_obj(gem_obj) \
  
Paul Cercueil June 7, 2023, 12:09 p.m. UTC | #3
Hi Sui,

Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> Hi,
> 
> 
> On 2023/6/7 17:36, Paul Cercueil wrote:
> > Hi Sui,
> > 
> > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > The single map_noncoherent member of struct drm_gem_dma_object
> > > may
> > > not
> > > sufficient for describing the backing memory of the GEM buffer
> > > object.
> > > 
> > > Especially on dma-coherent systems, the backing memory is both
> > > cached
> > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > device.
> > > Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
> > > 
> > > Whether a peripheral device is dma-coherent or not can be
> > > implementation-dependent. The single map_noncoherent option is
> > > not
> > > enough
> > > to reflect real hardware anymore. For example, the Loongson
> > > LS3A4000
> > > CPU
> > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > platform
> > > allways snoop CPU's cache. Doing the allocation with
> > > dma_alloc_coherent
> > > function is preferred. The return buffer is cached, it should not
> > > using
> > > the default write-combine mapping. While with the current
> > > implement,
> > > there
> > > no way to tell the drm core to reflect this.
> > > 
> > > This patch adds cached and coherent members to struct
> > > drm_gem_dma_object.
> > > which allow driver implements to inform the core. Introducing new
> > > mappings
> > > while keeping the original default behavior unchanged.
> > Did you try to simply set the "dma-coherent" property to the
> > device's
> > node?
> 
> But this approach can only be applied for the device driver with DT
> support.
> 
> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically do
> not 
> have DT support.
> 
> They using ACPI to pass parameter from the firmware to Linux kernel.
> 
> You approach will lost the effectiveness on such a case.

Well, I don't really know how ACPI handles it - but it should just be a
matter of setting dev->dma_coherent. That's basically what the DT code
does.

Some MIPS boards set it in their setup code for instance.

> >  From what I understand if you add that property then Linux will
> > use DMA
> > coherent memory even though you use dma_alloc_noncoherent() and the
> > sync_single_for_cpu() / sync_single_for_device() are then NOPs.
> 
> Please do not mitigate the problems with confusing method.
> 
> 
> This approach not only tend to generate confusion but also 
> implement-dependent
> 
> and arch-dependent. It's definitely problematic.
> 
> 
> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH
> specific 
> thing.
> 
> Dependent on how does the arch_dma_ops is implemented.
> 
> 
> The definition of the coherent on different ARCH has different
> meanings.
> 
> The definition of the wirte-combine on different ARCH has different 
> meanings.
> 
> 
> The wirte-combine(uncache acceleration) on mips is non dma-coherent.

It is dma-coherent on Ingenic SoCs.

> 
> But on arm, It seem that wirte-combine is coherent. (guaranteed by
> arch 
> implement).
> 
> 
> I also heard using dma_alloc_coherent  to allocation the buffer for
> the 
> non-coherent doesn't hurt, but the reverse is not true.
> 
> 
> But please do not create confusion.
> 
> software composite is faster because better cacheusing rate and
> 
> cache is faster to read.
> 
> It is faster because it is cached, not because it is non-coherent.
> 
> non-coherent is arch thing and/or driver-side thing,
> 
> it is a side effect of  using the cached mapping.

Yes, I know that.

> 
> 
> It should left to driver to handle such a side effect. The device
> driver
> 
> know their device, so its the device driver's responsibility to
> maintain
> the coherency.  On loongson platform, we don't need to call 
> drm_fb_dma_sync_non_coherent() function, Its already guaranteed by
> hardware.

I understand. What I'm saying, is that you should be able to set
dma_obj->map_noncoherent (which would arguably be better named
"map_cached", but that's a different problem). Then the GEM code would
end up calling dma_alloc_noncoherent(), which will give you *cached*
memory. Then as long as dev->dma_coherent = true,
drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't
pointlessly sync/invalidate the caches.

And I disagree with you, the driver shouldn't handle such things. The
fact that it is better to use cached memory or uncached with write-
combine really is platform-specific and not something that the driver
should be aware of.

Cheers,
-Paul

> 
> 
> > Cheers,
> > -Paul
> > 
> > > Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
> > > ---
> > >   drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
> > >   drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
> > >   drivers/gpu/drm/drm_gem_dma_helper.c      | 20
> > > ++++++++++++++++----
> > >   drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
> > >   drivers/gpu/drm/rcar-du/Kconfig           |  2 --
> > >   drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
> > >   include/drm/drm_gem_dma_helper.h          |  7 +++++--
> > >   7 files changed, 34 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
> > > b/drivers/gpu/drm/drm_fb_dma_helper.c
> > > index 3b535ad1b07c..93ff05041192 100644
> > > --- a/drivers/gpu/drm/drm_fb_dma_helper.c
> > > +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
> > > @@ -106,16 +106,15 @@ dma_addr_t drm_fb_dma_get_gem_addr(struct
> > > drm_framebuffer *fb,
> > >   EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
> > >   
> > >   /**
> > > - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-
> > > coherent
> > > backing
> > > - *     memory
> > > + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached
> > > backing
> > > memory
> > >    * @drm: DRM device
> > >    * @old_state: Old plane state
> > >    * @state: New plane state
> > >    *
> > >    * This function can be used by drivers that use damage clips
> > > and
> > > have
> > > - * DMA GEM objects backed by non-coherent memory. Calling this
> > > function
> > > - * in a plane's .atomic_update ensures that all the data in the
> > > backing
> > > - * memory have been written to RAM.
> > > + * DMA GEM objects backed by cached memory. Calling this
> > > function in
> > > a
> > > + * plane's .atomic_update ensures that all the data in the
> > > backing
> > > memory
> > > + * have been written to RAM.
> > >    */
> > >   void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
> > >                                    struct drm_plane_state
> > > *old_state,
> > > @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
> > > drm_device *drm,
> > >   
> > >          for (i = 0; i < finfo->num_planes; i++) {
> > >                  dma_obj = drm_fb_dma_get_gem_obj(state->fb, i);
> > > -               if (!dma_obj->map_noncoherent)
> > > +               if (dma_obj->cached && dma_obj->coherent)
> > >                          continue;
> > >   
> > >                  daddr = drm_fb_dma_get_gem_addr(state->fb,
> > > state, i);
> > > diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
> > > b/drivers/gpu/drm/drm_fbdev_dma.c
> > > index d86773fa8ab0..49fe9b284cc8 100644
> > > --- a/drivers/gpu/drm/drm_fbdev_dma.c
> > > +++ b/drivers/gpu/drm/drm_fbdev_dma.c
> > > @@ -131,7 +131,7 @@ static int
> > > drm_fbdev_dma_helper_fb_probe(struct
> > > drm_fb_helper *fb_helper,
> > >   
> > >          /* screen */
> > >          info->flags |= FBINFO_VIRTFB; /* system memory */
> > > -       if (dma_obj->map_noncoherent)
> > > +       if (dma_obj->cached)
> > >                  info->flags |= FBINFO_READS_FAST; /* signal
> > > caching
> > > */
> > >          info->screen_size = sizes->surface_height * fb-
> > > >pitches[0];
> > >          info->screen_buffer = map.vaddr;
> > > diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
> > > b/drivers/gpu/drm/drm_gem_dma_helper.c
> > > index 870b90b78bc4..dec1d512bdf1 100644
> > > --- a/drivers/gpu/drm/drm_gem_dma_helper.c
> > > +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
> > > @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device *drm,
> > > size_t size, bool private)
> > >                  drm_gem_private_object_init(drm, gem_obj, size);
> > >   
> > >                  /* Always use writecombine for dma-buf mappings
> > > */
> > > -               dma_obj->map_noncoherent = false;
> > > +               /* FIXME: This is not always true, on some dma
> > > coherent system,
> > > +                * cached mappings should be preferred over
> > > writecombine
> > > +                */
> > > +               dma_obj->cached = false;
> > > +               dma_obj->coherent = false;
> > >          } else {
> > >                  ret = drm_gem_object_init(drm, gem_obj, size);
> > >          }
> > > @@ -143,7 +147,11 @@ struct drm_gem_dma_object
> > > *drm_gem_dma_create(struct drm_device *drm,
> > >          if (IS_ERR(dma_obj))
> > >                  return dma_obj;
> > >   
> > > -       if (dma_obj->map_noncoherent) {
> > > +       if (dma_obj->cached && dma_obj->coherent) {
> > > +               dma_obj->vaddr = dma_alloc_coherent(drm->dev,
> > > size,
> > > +                                                   &dma_obj-
> > > > dma_addr,
> > > +                                                   GFP_KERNEL |
> > > __GFP_NOWARN);
> > > +       } else if (dma_obj->cached && !dma_obj->coherent) {
> > >                  dma_obj->vaddr = dma_alloc_noncoherent(drm->dev,
> > > size,
> > >                                                         &dma_obj-
> > > > dma_addr,
> > >                                                        
> > > DMA_TO_DEVICE,
> > > @@ -153,6 +161,7 @@ struct drm_gem_dma_object
> > > *drm_gem_dma_create(struct drm_device *drm,
> > >                                                &dma_obj-
> > > >dma_addr,
> > >                                                GFP_KERNEL |
> > > __GFP_NOWARN);
> > >          }
> > > +
> > >          if (!dma_obj->vaddr) {
> > >                  drm_dbg(drm, "failed to allocate buffer with
> > > size
> > > %zu\n",
> > >                           size);
> > > @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct
> > > drm_gem_dma_object
> > > *dma_obj)
> > >                          dma_buf_vunmap_unlocked(gem_obj-
> > > > import_attach->dmabuf, &map);
> > >                  drm_prime_gem_destroy(gem_obj, dma_obj->sgt);
> > >          } else if (dma_obj->vaddr) {
> > > -               if (dma_obj->map_noncoherent)
> > > +               if (dma_obj->cached && dma_obj->coherent)
> > > +                       dma_free_coherent(gem_obj->dev->dev,
> > > dma_obj-
> > > > base.size,
> > > +                                         dma_obj->vaddr,
> > > dma_obj-
> > > > dma_addr);
> > > +               else if (dma_obj->cached && !dma_obj->coherent)
> > >                          dma_free_noncoherent(gem_obj->dev->dev,
> > > dma_obj->base.size,
> > >                                               dma_obj->vaddr,
> > > dma_obj-
> > > > dma_addr,
> > >                                               DMA_TO_DEVICE);
> > > @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct
> > > drm_gem_dma_object
> > > *dma_obj, struct vm_area_struct *
> > >          vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
> > >          vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
> > >   
> > > -       if (dma_obj->map_noncoherent) {
> > > +       if (dma_obj->cached) {
> > >                  vma->vm_page_prot = vm_get_page_prot(vma-
> > > >vm_flags);
> > >   
> > >                  ret = dma_mmap_pages(dma_obj->base.dev->dev,
> > > diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > index 5ec75e9ba499..a3df2f99a757 100644
> > > --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct
> > > drm_device
> > > *drm, size_t size)
> > >          if (!obj)
> > >                  return ERR_PTR(-ENOMEM);
> > >   
> > > -       obj->map_noncoherent = priv->soc_info->map_noncoherent;
> > > +       if (priv->soc_info->map_noncoherent) {
> > > +               obj->cached = true;
> > > +               obj->coherent = false;
> > > +       }
> > >   
> > >          return &obj->base;
> > >   }
> > > diff --git a/drivers/gpu/drm/rcar-du/Kconfig
> > > b/drivers/gpu/drm/rcar-
> > > du/Kconfig
> > > index 53c356aed5d5..dddc70c08bdc 100644
> > > --- a/drivers/gpu/drm/rcar-du/Kconfig
> > > +++ b/drivers/gpu/drm/rcar-du/Kconfig
> > > @@ -2,8 +2,6 @@
> > >   config DRM_RCAR_DU
> > >          tristate "DRM Support for R-Car Display Unit"
> > >          depends on DRM && OF
> > > -       depends on ARM || ARM64
> > > -       depends on ARCH_RENESAS || COMPILE_TEST
> > >          select DRM_KMS_HELPER
> > >          select DRM_GEM_DMA_HELPER
> > >          select VIDEOMODE_HELPERS
> > > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > index adfb36b0e815..1142d51473e6 100644
> > > --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > @@ -386,7 +386,9 @@ struct drm_gem_object
> > > *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
> > >          gem_obj->funcs = &rcar_du_gem_funcs;
> > >   
> > >          drm_gem_private_object_init(dev, gem_obj, attach-
> > > >dmabuf-
> > > > size);
> > > -       dma_obj->map_noncoherent = false;
> > > +
> > > +       dma_obj->cached = false;
> > > +       dma_obj->coherent = false;
> > >   
> > >          ret = drm_gem_create_mmap_offset(gem_obj);
> > >          if (ret) {
> > > diff --git a/include/drm/drm_gem_dma_helper.h
> > > b/include/drm/drm_gem_dma_helper.h
> > > index 8a043235dad8..585ce3d4d1eb 100644
> > > --- a/include/drm/drm_gem_dma_helper.h
> > > +++ b/include/drm/drm_gem_dma_helper.h
> > > @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
> > >    *       more than one entry but they are guaranteed to have
> > > contiguous
> > >    *       DMA addresses.
> > >    * @vaddr: kernel virtual address of the backing memory
> > > - * @map_noncoherent: if true, the GEM object is backed by non-
> > > coherent memory
> > > + * @cached: if true, the GEM object is backed by cached memory
> > > + * @coherent: This option only meaningful when a GEM object is
> > > cached.
> > > + *            If true, Sync the GEM object for DMA access is not
> > > required.
> > >    */
> > >   struct drm_gem_dma_object {
> > >          struct drm_gem_object base;
> > > @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
> > >          /* For objects with DMA memory allocated by GEM DMA */
> > >          void *vaddr;
> > >   
> > > -       bool map_noncoherent;
> > > +       bool cached;
> > > +       bool coherent;
> > >   };
> > >   
> > >   #define to_drm_gem_dma_obj(gem_obj) \
>
  
Maxime Ripard June 7, 2023, 12:19 p.m. UTC | #4
On Wed, Jun 07, 2023 at 06:30:01PM +0800, Sui Jingfeng wrote:
> On 2023/6/7 17:36, Paul Cercueil wrote:
> > Hi Sui,
> > 
> > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > The single map_noncoherent member of struct drm_gem_dma_object may
> > > not
> > > sufficient for describing the backing memory of the GEM buffer
> > > object.
> > > 
> > > Especially on dma-coherent systems, the backing memory is both cached
> > > coherent for multi-core CPUs and dma-coherent for peripheral device.
> > > Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
> > > 
> > > Whether a peripheral device is dma-coherent or not can be
> > > implementation-dependent. The single map_noncoherent option is not
> > > enough
> > > to reflect real hardware anymore. For example, the Loongson LS3A4000
> > > CPU
> > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > platform
> > > allways snoop CPU's cache. Doing the allocation with
> > > dma_alloc_coherent
> > > function is preferred. The return buffer is cached, it should not
> > > using
> > > the default write-combine mapping. While with the current implement,
> > > there
> > > no way to tell the drm core to reflect this.
> > > 
> > > This patch adds cached and coherent members to struct
> > > drm_gem_dma_object.
> > > which allow driver implements to inform the core. Introducing new
> > > mappings
> > > while keeping the original default behavior unchanged.
> > Did you try to simply set the "dma-coherent" property to the device's
> > node?
> 
> But this approach can only be applied for the device driver with DT support.
>
> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically do not
> have DT support.
> 
> They using ACPI to pass parameter from the firmware to Linux kernel.
> 
> You approach will lost the effectiveness on such a case.

Not really, no. All DT support is doing is setting some generic device
parameter based on that property, but the infrastructure is very much
generic and can be used on systems without a DT.

> >  From what I understand if you add that property then Linux will use DMA
> > coherent memory even though you use dma_alloc_noncoherent() and the
> > sync_single_for_cpu() / sync_single_for_device() are then NOPs.
 >
> Please do not mitigate the problems with confusing method.

It's not a confusing method, it's one of the two main API to deal with
DMA buffers. And you might disagree with Paul but there's no need to be
rude about it.

> This approach not only tend to generate confusion but also
> implement-dependent and arch-dependent. It's definitely problematic.
> 
> 
> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH specific
> thing.
> 
> Dependent on how does the arch_dma_ops is implemented.
> 
> 
> The definition of the coherent on different ARCH has different meanings.
> 
> The definition of the wirte-combine on different ARCH has different
> meanings.
> 
> 
> The wirte-combine(uncache acceleration) on mips is non dma-coherent.

Then MIPS breaks the DMA allocation semantics. A buffer allocated with
dma_alloc_wc is supposed to be coherent.

> But on arm, It seem that wirte-combine is coherent. (guaranteed by arch
> implement).
> 
> 
> I also heard using dma_alloc_coherent  to allocation the buffer for the
> non-coherent doesn't hurt, but the reverse is not true.
> 
> 
> But please do not create confusion.
> 
> software composite is faster because better cacheusing rate and
> 
> cache is faster to read.
> 
> It is faster because it is cached, not because it is non-coherent.
> 
> non-coherent is arch thing and/or driver-side thing,
> 
> it is a side effect of  using the cached mapping.

Honestly, it's not clear to me what your point or issue is.

Going back to the description in your commit log, you mention that you
want to support multiple hardware that might or might not be DMA
coherent, and thus you want to allocate a buffer with different
attributes depending on that?

Like, you say that the LS3A4000 has a coherency unit and thus doing the
allocation with dma_alloc_coherent is preferred. Preferred to what? A WC
buffer? Why?

A WC buffer is a coherent buffer that is allowed to cache writes.

It doesn't have to, and worst case scenario you're inexactly the same
case than a dma_alloc_coherent buffer.

> It should left to driver to handle such a side effect. The device
> driver know their device, so its the device driver's responsibility to
> maintain the coherency.

Not really, no. Some driver are used across multiple SoCs and multiple
arch. It doesn't make any sense to encode this in the driver... which is
why it's in the DT in the first place, and abstracted away by the DMA
API. Like, do you really expect the amdgpu driver to know the DMA
attributes it needs to allocate a buffer from when running from a
RaspberryPi?

> On loongson platform, we don't need to call
> drm_fb_dma_sync_non_coherent() function, Its already guaranteed by
> hardware.

And mostly guaranteed by dma_alloc_coherent. And if you wanted to call
it anyway, it would be a nop if the device is declared as coherent
already.

I think you're thinking about this backward. A buffer has mapping
attributes, and a device has hardware properties.

The driver (ie, software) will allocate a buffer with some mapping
attributes, and will assume that they are met in the rest of its code.
How they are met is an implementation detail of the hardware, and for
all the driver cares, it doesn't have to match.

You can allocate a WC buffer to use on a non-coherent device and that's
fine. You can allocate a non-coherent buffer on a coherent device and
that's fine too. The DMA API will make everything work when it needs to,
and if the hardware already provides stronger guarantees, then it will
just skip whatever is redundant.

So you need to write your driver using buffer is the most convenient for
you, and it's really all that matters at the driver level. But for that
to work, you need to flag the coherence-ness of your devices properly,
like Paul suggested.

Maxime
  
Sui Jingfeng June 7, 2023, 2:38 p.m. UTC | #5
Hi,  welcome to discussion.


I have limited skills in manipulating English.

It may not express what I'm really means in the short time.

Part of word in the sentence may not as accurate as your.

Well, please don't misunderstand, I'm not doing the rude to you.

I will explain it with more details.

See below:


On 2023/6/7 20:09, Paul Cercueil wrote:
> Hi Sui,
>
> Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
>> Hi,
>>
>>
>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>> The single map_noncoherent member of struct drm_gem_dma_object
>>>> may
>>>> not
>>>> sufficient for describing the backing memory of the GEM buffer
>>>> object.
>>>>
>>>> Especially on dma-coherent systems, the backing memory is both
>>>> cached
>>>> coherent for multi-core CPUs and dma-coherent for peripheral
>>>> device.
>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
>>>>
>>>> Whether a peripheral device is dma-coherent or not can be
>>>> implementation-dependent. The single map_noncoherent option is
>>>> not
>>>> enough
>>>> to reflect real hardware anymore. For example, the Loongson
>>>> LS3A4000
>>>> CPU
>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>> platform
>>>> allways snoop CPU's cache. Doing the allocation with
>>>> dma_alloc_coherent
>>>> function is preferred. The return buffer is cached, it should not
>>>> using
>>>> the default write-combine mapping. While with the current
>>>> implement,
>>>> there
>>>> no way to tell the drm core to reflect this.
>>>>
>>>> This patch adds cached and coherent members to struct
>>>> drm_gem_dma_object.
>>>> which allow driver implements to inform the core. Introducing new
>>>> mappings
>>>> while keeping the original default behavior unchanged.
>>> Did you try to simply set the "dma-coherent" property to the
>>> device's
>>> node?
>> But this approach can only be applied for the device driver with DT
>> support.
>>
>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically do
>> not
>> have DT support.
>>
>> They using ACPI to pass parameter from the firmware to Linux kernel.
>>
>> You approach will lost the effectiveness on such a case.
> Well, I don't really know how ACPI handles it - but it should just be a
> matter of setting dev->dma_coherent. That's basically what the DT code
> does.
>
> Some MIPS boards set it in their setup code for instance.
>
This is a *strategy*, not a *mechanism*.

In this case, DT is just used to describing the hardware.

(It is actually a hardware feature describing language, the granularity 
is large)

It does not changing the state of the hardware.

It's your platform firmware or kernel setting up code who actually do 
such a things.


It's just that it works on *one* platform, it does not guarantee it will 
works on others.

While my patch is trying to create a *mechanism* which could probably

works on all platform.


It is based the patch you have already commuted.

Thanks for your excellent contribution.


>>>   From what I understand if you add that property then Linux will
>>> use DMA
>>> coherent memory even though you use dma_alloc_noncoherent() and the
>>> sync_single_for_cpu() / sync_single_for_device() are then NOPs.
>> Please do not mitigate the problems with confusing method.
>>
>>
>> This approach not only tend to generate confusion but also
>> implement-dependent
>>
>> and arch-dependent. It's definitely problematic.
>>
>>
>> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH
>> specific
>> thing.
>>
>> Dependent on how does the arch_dma_ops is implemented.
>>
>>
>> The definition of the coherent on different ARCH has different
>> meanings.
>>
>> The definition of the wirte-combine on different ARCH has different
>> meanings.
>>
>>
>> The wirte-combine(uncache acceleration) on mips is non dma-coherent.
> It is dma-coherent on Ingenic SoCs.
>
>
It is dma-coherent ? How does it achieve it?


As far as I know,  there is a write buffer within the mips cpu.

typically 64 byte,  but it is not cache. It will gather the CPU write 
access,

When a peripheral device do the DMA, how does you platform guarantee

the data in the CPU write buffer has been already arrived at (or flushed 
out to)

the system RAM?


Does the  peripheral device snoop the CPU's write buffer,

or it need manually flush the write buffer with SYNC instruction?

>> But on arm, It seem that wirte-combine is coherent. (guaranteed by
>> arch
>> implement).
>>
>>
>> I also heard using dma_alloc_coherent  to allocation the buffer for
>> the
>> non-coherent doesn't hurt, but the reverse is not true.
>>
>>
>> But please do not create confusion.
>>
>> software composite is faster because better cacheusing rate and
>>
>> cache is faster to read.
>>
>> It is faster because it is cached, not because it is non-coherent.
>>
>> non-coherent is arch thing and/or driver-side thing,
>>
>> it is a side effect of  using the cached mapping.
> Yes, I know that.
>
>>
>> It should left to driver to handle such a side effect. The device
>> driver
>>
>> know their device, so its the device driver's responsibility to
>> maintain
>> the coherency.  On loongson platform, we don't need to call
>> drm_fb_dma_sync_non_coherent() function, Its already guaranteed by
>> hardware.
> I understand. What I'm saying, is that you should be able to set
> dma_obj->map_noncoherent (which would arguably be better named
> "map_cached",

My point is that the word *cached* reflect the nature,

dma-coherent or dma-noncoherent is secondary.

We are all on the way to pursue the performance.

In the end, it is the cache give us the speed.


Why not we credit the cache hardware inside of the CPU?

> but that's a different problem). Then the GEM code would
> end up calling dma_alloc_noncoherent(), which will give you *cached*
> memory. Then as long as dev->dma_coherent = true,
> drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't
> pointlessly sync/invalidate the caches.
>
> And I disagree with you, the driver shouldn't handle such things.

You already handle the side effect of such things, See below:


```

    if (ingenic_drm_map_noncoherent(ipu->master))
         drm_fb_dma_sync_non_coherent(ipu->drm, oldstate, newstate);

```

By the way,  Ingenic is the only driver in the drivers/gpu/drm/ that 
handle such things

so far.

>   The
> fact that it is better to use cached memory or uncached with write-
> combine really is platform-specific and not something that the driver
> should be aware of.

But the fact is that,  It is drm/ingenic tell the drm core,  some SoC is 
prefer cached,

but unable to enforce the coherent. So that it need  flush the cache 
manually.

What do you meant by saying that the driver should not be aware of ?

> Cheers,
> -Paul
>
>>
>>> Cheers,
>>> -Paul
>>>
>>>> Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
>>>> ---
>>>>    drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
>>>>    drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
>>>>    drivers/gpu/drm/drm_gem_dma_helper.c      | 20
>>>> ++++++++++++++++----
>>>>    drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
>>>>    drivers/gpu/drm/rcar-du/Kconfig           |  2 --
>>>>    drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
>>>>    include/drm/drm_gem_dma_helper.h          |  7 +++++--
>>>>    7 files changed, 34 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>> b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>> index 3b535ad1b07c..93ff05041192 100644
>>>> --- a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>> +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>> @@ -106,16 +106,15 @@ dma_addr_t drm_fb_dma_get_gem_addr(struct
>>>> drm_framebuffer *fb,
>>>>    EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
>>>>    
>>>>    /**
>>>> - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-
>>>> coherent
>>>> backing
>>>> - *     memory
>>>> + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached
>>>> backing
>>>> memory
>>>>     * @drm: DRM device
>>>>     * @old_state: Old plane state
>>>>     * @state: New plane state
>>>>     *
>>>>     * This function can be used by drivers that use damage clips
>>>> and
>>>> have
>>>> - * DMA GEM objects backed by non-coherent memory. Calling this
>>>> function
>>>> - * in a plane's .atomic_update ensures that all the data in the
>>>> backing
>>>> - * memory have been written to RAM.
>>>> + * DMA GEM objects backed by cached memory. Calling this
>>>> function in
>>>> a
>>>> + * plane's .atomic_update ensures that all the data in the
>>>> backing
>>>> memory
>>>> + * have been written to RAM.
>>>>     */
>>>>    void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
>>>>                                     struct drm_plane_state
>>>> *old_state,
>>>> @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
>>>> drm_device *drm,
>>>>    
>>>>           for (i = 0; i < finfo->num_planes; i++) {
>>>>                   dma_obj = drm_fb_dma_get_gem_obj(state->fb, i);
>>>> -               if (!dma_obj->map_noncoherent)
>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>>                           continue;
>>>>    
>>>>                   daddr = drm_fb_dma_get_gem_addr(state->fb,
>>>> state, i);
>>>> diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
>>>> b/drivers/gpu/drm/drm_fbdev_dma.c
>>>> index d86773fa8ab0..49fe9b284cc8 100644
>>>> --- a/drivers/gpu/drm/drm_fbdev_dma.c
>>>> +++ b/drivers/gpu/drm/drm_fbdev_dma.c
>>>> @@ -131,7 +131,7 @@ static int
>>>> drm_fbdev_dma_helper_fb_probe(struct
>>>> drm_fb_helper *fb_helper,
>>>>    
>>>>           /* screen */
>>>>           info->flags |= FBINFO_VIRTFB; /* system memory */
>>>> -       if (dma_obj->map_noncoherent)
>>>> +       if (dma_obj->cached)
>>>>                   info->flags |= FBINFO_READS_FAST; /* signal
>>>> caching
>>>> */
>>>>           info->screen_size = sizes->surface_height * fb-
>>>>> pitches[0];
>>>>           info->screen_buffer = map.vaddr;
>>>> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>> b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>> index 870b90b78bc4..dec1d512bdf1 100644
>>>> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>> @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device *drm,
>>>> size_t size, bool private)
>>>>                   drm_gem_private_object_init(drm, gem_obj, size);
>>>>    
>>>>                   /* Always use writecombine for dma-buf mappings
>>>> */
>>>> -               dma_obj->map_noncoherent = false;
>>>> +               /* FIXME: This is not always true, on some dma
>>>> coherent system,
>>>> +                * cached mappings should be preferred over
>>>> writecombine
>>>> +                */
>>>> +               dma_obj->cached = false;
>>>> +               dma_obj->coherent = false;
>>>>           } else {
>>>>                   ret = drm_gem_object_init(drm, gem_obj, size);
>>>>           }
>>>> @@ -143,7 +147,11 @@ struct drm_gem_dma_object
>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>           if (IS_ERR(dma_obj))
>>>>                   return dma_obj;
>>>>    
>>>> -       if (dma_obj->map_noncoherent) {
>>>> +       if (dma_obj->cached && dma_obj->coherent) {
>>>> +               dma_obj->vaddr = dma_alloc_coherent(drm->dev,
>>>> size,
>>>> +                                                   &dma_obj-
>>>>> dma_addr,
>>>> +                                                   GFP_KERNEL |
>>>> __GFP_NOWARN);
>>>> +       } else if (dma_obj->cached && !dma_obj->coherent) {
>>>>                   dma_obj->vaddr = dma_alloc_noncoherent(drm->dev,
>>>> size,
>>>>                                                          &dma_obj-
>>>>> dma_addr,
>>>>                                                         
>>>> DMA_TO_DEVICE,
>>>> @@ -153,6 +161,7 @@ struct drm_gem_dma_object
>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>                                                 &dma_obj-
>>>>> dma_addr,
>>>>                                                 GFP_KERNEL |
>>>> __GFP_NOWARN);
>>>>           }
>>>> +
>>>>           if (!dma_obj->vaddr) {
>>>>                   drm_dbg(drm, "failed to allocate buffer with
>>>> size
>>>> %zu\n",
>>>>                            size);
>>>> @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct
>>>> drm_gem_dma_object
>>>> *dma_obj)
>>>>                           dma_buf_vunmap_unlocked(gem_obj-
>>>>> import_attach->dmabuf, &map);
>>>>                   drm_prime_gem_destroy(gem_obj, dma_obj->sgt);
>>>>           } else if (dma_obj->vaddr) {
>>>> -               if (dma_obj->map_noncoherent)
>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>> +                       dma_free_coherent(gem_obj->dev->dev,
>>>> dma_obj-
>>>>> base.size,
>>>> +                                         dma_obj->vaddr,
>>>> dma_obj-
>>>>> dma_addr);
>>>> +               else if (dma_obj->cached && !dma_obj->coherent)
>>>>                           dma_free_noncoherent(gem_obj->dev->dev,
>>>> dma_obj->base.size,
>>>>                                                dma_obj->vaddr,
>>>> dma_obj-
>>>>> dma_addr,
>>>>                                                DMA_TO_DEVICE);
>>>> @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct
>>>> drm_gem_dma_object
>>>> *dma_obj, struct vm_area_struct *
>>>>           vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
>>>>           vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
>>>>    
>>>> -       if (dma_obj->map_noncoherent) {
>>>> +       if (dma_obj->cached) {
>>>>                   vma->vm_page_prot = vm_get_page_prot(vma-
>>>>> vm_flags);
>>>>    
>>>>                   ret = dma_mmap_pages(dma_obj->base.dev->dev,
>>>> diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>> b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>> index 5ec75e9ba499..a3df2f99a757 100644
>>>> --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>> +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>> @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct
>>>> drm_device
>>>> *drm, size_t size)
>>>>           if (!obj)
>>>>                   return ERR_PTR(-ENOMEM);
>>>>    
>>>> -       obj->map_noncoherent = priv->soc_info->map_noncoherent;
>>>> +       if (priv->soc_info->map_noncoherent) {
>>>> +               obj->cached = true;
>>>> +               obj->coherent = false;
>>>> +       }
>>>>    
>>>>           return &obj->base;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/rcar-du/Kconfig
>>>> b/drivers/gpu/drm/rcar-
>>>> du/Kconfig
>>>> index 53c356aed5d5..dddc70c08bdc 100644
>>>> --- a/drivers/gpu/drm/rcar-du/Kconfig
>>>> +++ b/drivers/gpu/drm/rcar-du/Kconfig
>>>> @@ -2,8 +2,6 @@
>>>>    config DRM_RCAR_DU
>>>>           tristate "DRM Support for R-Car Display Unit"
>>>>           depends on DRM && OF
>>>> -       depends on ARM || ARM64
>>>> -       depends on ARCH_RENESAS || COMPILE_TEST
>>>>           select DRM_KMS_HELPER
>>>>           select DRM_GEM_DMA_HELPER
>>>>           select VIDEOMODE_HELPERS
>>>> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>> b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>> index adfb36b0e815..1142d51473e6 100644
>>>> --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>> +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>> @@ -386,7 +386,9 @@ struct drm_gem_object
>>>> *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
>>>>           gem_obj->funcs = &rcar_du_gem_funcs;
>>>>    
>>>>           drm_gem_private_object_init(dev, gem_obj, attach-
>>>>> dmabuf-
>>>>> size);
>>>> -       dma_obj->map_noncoherent = false;
>>>> +
>>>> +       dma_obj->cached = false;
>>>> +       dma_obj->coherent = false;
>>>>    
>>>>           ret = drm_gem_create_mmap_offset(gem_obj);
>>>>           if (ret) {
>>>> diff --git a/include/drm/drm_gem_dma_helper.h
>>>> b/include/drm/drm_gem_dma_helper.h
>>>> index 8a043235dad8..585ce3d4d1eb 100644
>>>> --- a/include/drm/drm_gem_dma_helper.h
>>>> +++ b/include/drm/drm_gem_dma_helper.h
>>>> @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
>>>>     *       more than one entry but they are guaranteed to have
>>>> contiguous
>>>>     *       DMA addresses.
>>>>     * @vaddr: kernel virtual address of the backing memory
>>>> - * @map_noncoherent: if true, the GEM object is backed by non-
>>>> coherent memory
>>>> + * @cached: if true, the GEM object is backed by cached memory
>>>> + * @coherent: This option only meaningful when a GEM object is
>>>> cached.
>>>> + *            If true, Sync the GEM object for DMA access is not
>>>> required.
>>>>     */
>>>>    struct drm_gem_dma_object {
>>>>           struct drm_gem_object base;
>>>> @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
>>>>           /* For objects with DMA memory allocated by GEM DMA */
>>>>           void *vaddr;
>>>>    
>>>> -       bool map_noncoherent;
>>>> +       bool cached;
>>>> +       bool coherent;
>>>>    };
>>>>    
>>>>    #define to_drm_gem_dma_obj(gem_obj) \
  
Paul Cercueil June 7, 2023, 4:12 p.m. UTC | #6
Hi Sui,

Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
> Hi,  welcome to discussion.
> 
> 
> I have limited skills in manipulating English.
> 
> It may not express what I'm really means in the short time.
> 
> Part of word in the sentence may not as accurate as your.
> 
> Well, please don't misunderstand, I'm not doing the rude to you.

No problem.

> 
> I will explain it with more details.
> 
> See below:
> 
> 
> On 2023/6/7 20:09, Paul Cercueil wrote:
> > Hi Sui,
> > 
> > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> > > Hi,
> > > 
> > > 
> > > On 2023/6/7 17:36, Paul Cercueil wrote:
> > > > Hi Sui,
> > > > 
> > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > > > The single map_noncoherent member of struct
> > > > > drm_gem_dma_object
> > > > > may
> > > > > not
> > > > > sufficient for describing the backing memory of the GEM
> > > > > buffer
> > > > > object.
> > > > > 
> > > > > Especially on dma-coherent systems, the backing memory is
> > > > > both
> > > > > cached
> > > > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > > > device.
> > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64,
> > > > > etc.
> > > > > 
> > > > > Whether a peripheral device is dma-coherent or not can be
> > > > > implementation-dependent. The single map_noncoherent option
> > > > > is
> > > > > not
> > > > > enough
> > > > > to reflect real hardware anymore. For example, the Loongson
> > > > > LS3A4000
> > > > > CPU
> > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > > > platform
> > > > > allways snoop CPU's cache. Doing the allocation with
> > > > > dma_alloc_coherent
> > > > > function is preferred. The return buffer is cached, it should
> > > > > not
> > > > > using
> > > > > the default write-combine mapping. While with the current
> > > > > implement,
> > > > > there
> > > > > no way to tell the drm core to reflect this.
> > > > > 
> > > > > This patch adds cached and coherent members to struct
> > > > > drm_gem_dma_object.
> > > > > which allow driver implements to inform the core. Introducing
> > > > > new
> > > > > mappings
> > > > > while keeping the original default behavior unchanged.
> > > > Did you try to simply set the "dma-coherent" property to the
> > > > device's
> > > > node?
> > > But this approach can only be applied for the device driver with
> > > DT
> > > support.
> > > 
> > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
> > > do
> > > not
> > > have DT support.
> > > 
> > > They using ACPI to pass parameter from the firmware to Linux
> > > kernel.
> > > 
> > > You approach will lost the effectiveness on such a case.
> > Well, I don't really know how ACPI handles it - but it should just
> > be a
> > matter of setting dev->dma_coherent. That's basically what the DT
> > code
> > does.
> > 
> > Some MIPS boards set it in their setup code for instance.
> > 
> This is a *strategy*, not a *mechanism*.
> 
> In this case, DT is just used to describing the hardware.
> 
> (It is actually a hardware feature describing language, the
> granularity 
> is large)
> 
> It does not changing the state of the hardware.
> 
> It's your platform firmware or kernel setting up code who actually do
> such a things.
> 
> 
> It's just that it works on *one* platform, it does not guarantee it
> will 
> works on others.

If you add the "dma-coherent" property in a device node in DT, you
effectively specify that the device is DMA-coherent; so you describe
the hardware, which is what DT is for, and you are not changing the
state of the hardware.

Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
default to DMA-coherent mapping; I believe you could do something
similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.


> While my patch is trying to create a *mechanism* which could probably
> 
> works on all platform.
> 
> 
> It is based the patch you have already commuted.
> 
> Thanks for your excellent contribution.
> 
> 
> > > >   From what I understand if you add that property then Linux
> > > > will
> > > > use DMA
> > > > coherent memory even though you use dma_alloc_noncoherent() and
> > > > the
> > > > sync_single_for_cpu() / sync_single_for_device() are then NOPs.
> > > Please do not mitigate the problems with confusing method.
> > > 
> > > 
> > > This approach not only tend to generate confusion but also
> > > implement-dependent
> > > 
> > > and arch-dependent. It's definitely problematic.
> > > 
> > > 
> > > How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH
> > > specific
> > > thing.
> > > 
> > > Dependent on how does the arch_dma_ops is implemented.
> > > 
> > > 
> > > The definition of the coherent on different ARCH has different
> > > meanings.
> > > 
> > > The definition of the wirte-combine on different ARCH has
> > > different
> > > meanings.
> > > 
> > > 
> > > The wirte-combine(uncache acceleration) on mips is non dma-
> > > coherent.
> > It is dma-coherent on Ingenic SoCs.
> > 
> > 
> It is dma-coherent ? How does it achieve it?
> 
> 
> As far as I know,  there is a write buffer within the mips cpu.
> 
> typically 64 byte,  but it is not cache. It will gather the CPU write
> access,
> 
> When a peripheral device do the DMA, how does you platform guarantee
> 
> the data in the CPU write buffer has been already arrived at (or
> flushed 
> out to)
> 
> the system RAM?
> 
> 
> Does the  peripheral device snoop the CPU's write buffer,
> 
> or it need manually flush the write buffer with SYNC instruction?

I believe the DMA flushes the write buffer? I don't actually know the
details, it would be something to ask to Ingenic.

> 
> > > But on arm, It seem that wirte-combine is coherent. (guaranteed
> > > by
> > > arch
> > > implement).
> > > 
> > > 
> > > I also heard using dma_alloc_coherent  to allocation the buffer
> > > for
> > > the
> > > non-coherent doesn't hurt, but the reverse is not true.
> > > 
> > > 
> > > But please do not create confusion.
> > > 
> > > software composite is faster because better cacheusing rate and
> > > 
> > > cache is faster to read.
> > > 
> > > It is faster because it is cached, not because it is non-
> > > coherent.
> > > 
> > > non-coherent is arch thing and/or driver-side thing,
> > > 
> > > it is a side effect of  using the cached mapping.
> > Yes, I know that.
> > 
> > > 
> > > It should left to driver to handle such a side effect. The device
> > > driver
> > > 
> > > know their device, so its the device driver's responsibility to
> > > maintain
> > > the coherency.  On loongson platform, we don't need to call
> > > drm_fb_dma_sync_non_coherent() function, Its already guaranteed
> > > by
> > > hardware.
> > I understand. What I'm saying, is that you should be able to set
> > dma_obj->map_noncoherent (which would arguably be better named
> > "map_cached",
> 
> My point is that the word *cached* reflect the nature,
> 
> dma-coherent or dma-noncoherent is secondary.
> 
> We are all on the way to pursue the performance.
> 
> In the end, it is the cache give us the speed.
> 
> 
> Why not we credit the cache hardware inside of the CPU?

dma_alloc_noncoherent() gives you *cached* memory.

Therefore, if you want *cached* memory, you should set
gem->map_noncoherent.

I understand your confusion; it would be easier to understand if this
function was called dma_alloc_cached().

Then, if the memory is actually DMA-coherent for the device (dev-
>dma_coherent == true), the drm_fb_dma_sync_non_coherent() function is
a no-op.

But in both cases (DMA-coherent device, non DMA-coherent device), if
you want cached buffers, you should call dma_alloc_noncoherent().


> > but that's a different problem). Then the GEM code would
> > end up calling dma_alloc_noncoherent(), which will give you
> > *cached*
> > memory. Then as long as dev->dma_coherent = true,
> > drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't
> > pointlessly sync/invalidate the caches.
> > 
> > And I disagree with you, the driver shouldn't handle such things.
> 
> You already handle the side effect of such things, See below:
> 
> 
> ```
> 
>     if (ingenic_drm_map_noncoherent(ipu->master))
>          drm_fb_dma_sync_non_coherent(ipu->drm, oldstate, newstate);
> 
> ```
> 
> By the way,  Ingenic is the only driver in the drivers/gpu/drm/ that 
> handle such things
> 
> so far.

Yes; and now I think that this was a bad idea (for the reasons Maxime
listed in his email).

> 
> >   The
> > fact that it is better to use cached memory or uncached with write-
> > combine really is platform-specific and not something that the
> > driver
> > should be aware of.
> 
> But the fact is that,  It is drm/ingenic tell the drm core,  some SoC
> is 
> prefer cached,
> 
> but unable to enforce the coherent. So that it need  flush the cache 
> manually.
> 
> What do you meant by saying that the driver should not be aware of ?

Ideally, the driver should just call a function "dma_alloc_buffer",
which would return cached memory when it makes sense, otherwise a
uncached buffer with the write-combine attribute.

Then the arch code (or DT) can decide what's the best setting, and not
the driver.

In the meantime, you should use gem->dma_noncoherent like the ingenic-
drm driver does - until somebody (probably me) refactor things.

Cheers,
-Paul

> 
> > Cheers,
> > -Paul
> > 
> > > 
> > > > Cheers,
> > > > -Paul
> > > > 
> > > > > Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
> > > > > ---
> > > > >    drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
> > > > >    drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
> > > > >    drivers/gpu/drm/drm_gem_dma_helper.c      | 20
> > > > > ++++++++++++++++----
> > > > >    drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
> > > > >    drivers/gpu/drm/rcar-du/Kconfig           |  2 --
> > > > >    drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
> > > > >    include/drm/drm_gem_dma_helper.h          |  7 +++++--
> > > > >    7 files changed, 34 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
> > > > > b/drivers/gpu/drm/drm_fb_dma_helper.c
> > > > > index 3b535ad1b07c..93ff05041192 100644
> > > > > --- a/drivers/gpu/drm/drm_fb_dma_helper.c
> > > > > +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
> > > > > @@ -106,16 +106,15 @@ dma_addr_t
> > > > > drm_fb_dma_get_gem_addr(struct
> > > > > drm_framebuffer *fb,
> > > > >    EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
> > > > >    
> > > > >    /**
> > > > > - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-
> > > > > coherent
> > > > > backing
> > > > > - *     memory
> > > > > + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached
> > > > > backing
> > > > > memory
> > > > >     * @drm: DRM device
> > > > >     * @old_state: Old plane state
> > > > >     * @state: New plane state
> > > > >     *
> > > > >     * This function can be used by drivers that use damage
> > > > > clips
> > > > > and
> > > > > have
> > > > > - * DMA GEM objects backed by non-coherent memory. Calling
> > > > > this
> > > > > function
> > > > > - * in a plane's .atomic_update ensures that all the data in
> > > > > the
> > > > > backing
> > > > > - * memory have been written to RAM.
> > > > > + * DMA GEM objects backed by cached memory. Calling this
> > > > > function in
> > > > > a
> > > > > + * plane's .atomic_update ensures that all the data in the
> > > > > backing
> > > > > memory
> > > > > + * have been written to RAM.
> > > > >     */
> > > > >    void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
> > > > >                                     struct drm_plane_state
> > > > > *old_state,
> > > > > @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
> > > > > drm_device *drm,
> > > > >    
> > > > >           for (i = 0; i < finfo->num_planes; i++) {
> > > > >                   dma_obj = drm_fb_dma_get_gem_obj(state->fb,
> > > > > i);
> > > > > -               if (!dma_obj->map_noncoherent)
> > > > > +               if (dma_obj->cached && dma_obj->coherent)
> > > > >                           continue;
> > > > >    
> > > > >                   daddr = drm_fb_dma_get_gem_addr(state->fb,
> > > > > state, i);
> > > > > diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
> > > > > b/drivers/gpu/drm/drm_fbdev_dma.c
> > > > > index d86773fa8ab0..49fe9b284cc8 100644
> > > > > --- a/drivers/gpu/drm/drm_fbdev_dma.c
> > > > > +++ b/drivers/gpu/drm/drm_fbdev_dma.c
> > > > > @@ -131,7 +131,7 @@ static int
> > > > > drm_fbdev_dma_helper_fb_probe(struct
> > > > > drm_fb_helper *fb_helper,
> > > > >    
> > > > >           /* screen */
> > > > >           info->flags |= FBINFO_VIRTFB; /* system memory */
> > > > > -       if (dma_obj->map_noncoherent)
> > > > > +       if (dma_obj->cached)
> > > > >                   info->flags |= FBINFO_READS_FAST; /* signal
> > > > > caching
> > > > > */
> > > > >           info->screen_size = sizes->surface_height * fb-
> > > > > > pitches[0];
> > > > >           info->screen_buffer = map.vaddr;
> > > > > diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
> > > > > b/drivers/gpu/drm/drm_gem_dma_helper.c
> > > > > index 870b90b78bc4..dec1d512bdf1 100644
> > > > > --- a/drivers/gpu/drm/drm_gem_dma_helper.c
> > > > > +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
> > > > > @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device
> > > > > *drm,
> > > > > size_t size, bool private)
> > > > >                   drm_gem_private_object_init(drm, gem_obj,
> > > > > size);
> > > > >    
> > > > >                   /* Always use writecombine for dma-buf
> > > > > mappings
> > > > > */
> > > > > -               dma_obj->map_noncoherent = false;
> > > > > +               /* FIXME: This is not always true, on some
> > > > > dma
> > > > > coherent system,
> > > > > +                * cached mappings should be preferred over
> > > > > writecombine
> > > > > +                */
> > > > > +               dma_obj->cached = false;
> > > > > +               dma_obj->coherent = false;
> > > > >           } else {
> > > > >                   ret = drm_gem_object_init(drm, gem_obj,
> > > > > size);
> > > > >           }
> > > > > @@ -143,7 +147,11 @@ struct drm_gem_dma_object
> > > > > *drm_gem_dma_create(struct drm_device *drm,
> > > > >           if (IS_ERR(dma_obj))
> > > > >                   return dma_obj;
> > > > >    
> > > > > -       if (dma_obj->map_noncoherent) {
> > > > > +       if (dma_obj->cached && dma_obj->coherent) {
> > > > > +               dma_obj->vaddr = dma_alloc_coherent(drm->dev,
> > > > > size,
> > > > > +                                                   &dma_obj-
> > > > > > dma_addr,
> > > > > +                                                  
> > > > > GFP_KERNEL |
> > > > > __GFP_NOWARN);
> > > > > +       } else if (dma_obj->cached && !dma_obj->coherent) {
> > > > >                   dma_obj->vaddr = dma_alloc_noncoherent(drm-
> > > > > >dev,
> > > > > size,
> > > > >                                                         
> > > > > &dma_obj-
> > > > > > dma_addr,
> > > > >                                                         
> > > > > DMA_TO_DEVICE,
> > > > > @@ -153,6 +161,7 @@ struct drm_gem_dma_object
> > > > > *drm_gem_dma_create(struct drm_device *drm,
> > > > >                                                 &dma_obj-
> > > > > > dma_addr,
> > > > >                                                 GFP_KERNEL |
> > > > > __GFP_NOWARN);
> > > > >           }
> > > > > +
> > > > >           if (!dma_obj->vaddr) {
> > > > >                   drm_dbg(drm, "failed to allocate buffer
> > > > > with
> > > > > size
> > > > > %zu\n",
> > > > >                            size);
> > > > > @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct
> > > > > drm_gem_dma_object
> > > > > *dma_obj)
> > > > >                           dma_buf_vunmap_unlocked(gem_obj-
> > > > > > import_attach->dmabuf, &map);
> > > > >                   drm_prime_gem_destroy(gem_obj, dma_obj-
> > > > > >sgt);
> > > > >           } else if (dma_obj->vaddr) {
> > > > > -               if (dma_obj->map_noncoherent)
> > > > > +               if (dma_obj->cached && dma_obj->coherent)
> > > > > +                       dma_free_coherent(gem_obj->dev->dev,
> > > > > dma_obj-
> > > > > > base.size,
> > > > > +                                         dma_obj->vaddr,
> > > > > dma_obj-
> > > > > > dma_addr);
> > > > > +               else if (dma_obj->cached && !dma_obj-
> > > > > >coherent)
> > > > >                           dma_free_noncoherent(gem_obj->dev-
> > > > > >dev,
> > > > > dma_obj->base.size,
> > > > >                                                dma_obj-
> > > > > >vaddr,
> > > > > dma_obj-
> > > > > > dma_addr,
> > > > >                                               
> > > > > DMA_TO_DEVICE);
> > > > > @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct
> > > > > drm_gem_dma_object
> > > > > *dma_obj, struct vm_area_struct *
> > > > >           vma->vm_pgoff -= drm_vma_node_start(&obj-
> > > > > >vma_node);
> > > > >           vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
> > > > >    
> > > > > -       if (dma_obj->map_noncoherent) {
> > > > > +       if (dma_obj->cached) {
> > > > >                   vma->vm_page_prot = vm_get_page_prot(vma-
> > > > > > vm_flags);
> > > > >    
> > > > >                   ret = dma_mmap_pages(dma_obj->base.dev-
> > > > > >dev,
> > > > > diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > > > b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > > > index 5ec75e9ba499..a3df2f99a757 100644
> > > > > --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > > > +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
> > > > > @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct
> > > > > drm_device
> > > > > *drm, size_t size)
> > > > >           if (!obj)
> > > > >                   return ERR_PTR(-ENOMEM);
> > > > >    
> > > > > -       obj->map_noncoherent = priv->soc_info-
> > > > > >map_noncoherent;
> > > > > +       if (priv->soc_info->map_noncoherent) {
> > > > > +               obj->cached = true;
> > > > > +               obj->coherent = false;
> > > > > +       }
> > > > >    
> > > > >           return &obj->base;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/rcar-du/Kconfig
> > > > > b/drivers/gpu/drm/rcar-
> > > > > du/Kconfig
> > > > > index 53c356aed5d5..dddc70c08bdc 100644
> > > > > --- a/drivers/gpu/drm/rcar-du/Kconfig
> > > > > +++ b/drivers/gpu/drm/rcar-du/Kconfig
> > > > > @@ -2,8 +2,6 @@
> > > > >    config DRM_RCAR_DU
> > > > >           tristate "DRM Support for R-Car Display Unit"
> > > > >           depends on DRM && OF
> > > > > -       depends on ARM || ARM64
> > > > > -       depends on ARCH_RENESAS || COMPILE_TEST
> > > > >           select DRM_KMS_HELPER
> > > > >           select DRM_GEM_DMA_HELPER
> > > > >           select VIDEOMODE_HELPERS
> > > > > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > > > b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > > > index adfb36b0e815..1142d51473e6 100644
> > > > > --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > > > +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
> > > > > @@ -386,7 +386,9 @@ struct drm_gem_object
> > > > > *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
> > > > >           gem_obj->funcs = &rcar_du_gem_funcs;
> > > > >    
> > > > >           drm_gem_private_object_init(dev, gem_obj, attach-
> > > > > > dmabuf-
> > > > > > size);
> > > > > -       dma_obj->map_noncoherent = false;
> > > > > +
> > > > > +       dma_obj->cached = false;
> > > > > +       dma_obj->coherent = false;
> > > > >    
> > > > >           ret = drm_gem_create_mmap_offset(gem_obj);
> > > > >           if (ret) {
> > > > > diff --git a/include/drm/drm_gem_dma_helper.h
> > > > > b/include/drm/drm_gem_dma_helper.h
> > > > > index 8a043235dad8..585ce3d4d1eb 100644
> > > > > --- a/include/drm/drm_gem_dma_helper.h
> > > > > +++ b/include/drm/drm_gem_dma_helper.h
> > > > > @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
> > > > >     *       more than one entry but they are guaranteed to
> > > > > have
> > > > > contiguous
> > > > >     *       DMA addresses.
> > > > >     * @vaddr: kernel virtual address of the backing memory
> > > > > - * @map_noncoherent: if true, the GEM object is backed by
> > > > > non-
> > > > > coherent memory
> > > > > + * @cached: if true, the GEM object is backed by cached
> > > > > memory
> > > > > + * @coherent: This option only meaningful when a GEM object
> > > > > is
> > > > > cached.
> > > > > + *            If true, Sync the GEM object for DMA access is
> > > > > not
> > > > > required.
> > > > >     */
> > > > >    struct drm_gem_dma_object {
> > > > >           struct drm_gem_object base;
> > > > > @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
> > > > >           /* For objects with DMA memory allocated by GEM DMA
> > > > > */
> > > > >           void *vaddr;
> > > > >    
> > > > > -       bool map_noncoherent;
> > > > > +       bool cached;
> > > > > +       bool coherent;
> > > > >    };
> > > > >    
> > > > >    #define to_drm_gem_dma_obj(gem_obj) \
>
  
Sui Jingfeng June 7, 2023, 5:18 p.m. UTC | #7
Hi,

On 2023/6/8 00:12, Paul Cercueil wrote:
> Hi Sui,
>
> Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
>> Hi,  welcome to discussion.
>>
>>
>> I have limited skills in manipulating English.
>>
>> It may not express what I'm really means in the short time.
>>
>> Part of word in the sentence may not as accurate as your.
>>
>> Well, please don't misunderstand, I'm not doing the rude to you.
> No problem.
>
>> I will explain it with more details.
>>
>> See below:
>>
>>
>> On 2023/6/7 20:09, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
>>>> Hi,
>>>>
>>>>
>>>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>>>> Hi Sui,
>>>>>
>>>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>>>> The single map_noncoherent member of struct
>>>>>> drm_gem_dma_object
>>>>>> may
>>>>>> not
>>>>>> sufficient for describing the backing memory of the GEM
>>>>>> buffer
>>>>>> object.
>>>>>>
>>>>>> Especially on dma-coherent systems, the backing memory is
>>>>>> both
>>>>>> cached
>>>>>> coherent for multi-core CPUs and dma-coherent for peripheral
>>>>>> device.
>>>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64,
>>>>>> etc.
>>>>>>
>>>>>> Whether a peripheral device is dma-coherent or not can be
>>>>>> implementation-dependent. The single map_noncoherent option
>>>>>> is
>>>>>> not
>>>>>> enough
>>>>>> to reflect real hardware anymore. For example, the Loongson
>>>>>> LS3A4000
>>>>>> CPU
>>>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>>>> platform
>>>>>> allways snoop CPU's cache. Doing the allocation with
>>>>>> dma_alloc_coherent
>>>>>> function is preferred. The return buffer is cached, it should
>>>>>> not
>>>>>> using
>>>>>> the default write-combine mapping. While with the current
>>>>>> implement,
>>>>>> there
>>>>>> no way to tell the drm core to reflect this.
>>>>>>
>>>>>> This patch adds cached and coherent members to struct
>>>>>> drm_gem_dma_object.
>>>>>> which allow driver implements to inform the core. Introducing
>>>>>> new
>>>>>> mappings
>>>>>> while keeping the original default behavior unchanged.
>>>>> Did you try to simply set the "dma-coherent" property to the
>>>>> device's
>>>>> node?
>>>> But this approach can only be applied for the device driver with
>>>> DT
>>>> support.
>>>>
>>>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
>>>> do
>>>> not
>>>> have DT support.
>>>>
>>>> They using ACPI to pass parameter from the firmware to Linux
>>>> kernel.
>>>>
>>>> You approach will lost the effectiveness on such a case.
>>> Well, I don't really know how ACPI handles it - but it should just
>>> be a
>>> matter of setting dev->dma_coherent. That's basically what the DT
>>> code
>>> does.
>>>
>>> Some MIPS boards set it in their setup code for instance.
>>>
>> This is a *strategy*, not a *mechanism*.
>>
>> In this case, DT is just used to describing the hardware.
>>
>> (It is actually a hardware feature describing language, the
>> granularity
>> is large)
>>
>> It does not changing the state of the hardware.
>>
>> It's your platform firmware or kernel setting up code who actually do
>> such a things.
>>
>>
>> It's just that it works on *one* platform, it does not guarantee it
>> will
>> works on others.
> If you add the "dma-coherent" property in a device node in DT, you
> effectively specify that the device is DMA-coherent; so you describe
> the hardware, which is what DT is for, and you are not changing the
> state of the hardware.
>
> Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> default to DMA-coherent mapping; I believe you could do something
> similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
>
The preblem is that device driver can have various demand.

It probably want to create different kind of buffers for different thing 
simultaneously.

Say, one allocated with dma_alloc_coherent for command buffer or dma 
descriptor

another one allocated with  dma_alloc_wc for uploading shader etc.

also has the third one allocated with dma_alloc_noncoherent() for doing 
some else.


Simple setting by DT or firmware which override all allocation is not 
what we want.

  My patch is toward the drm core, leave the choice to the device drivers.


How does the device driver fetch hardware descriptions is the device 
driver's thing.

either via DT, or ACPI, kernel cmd or hard-code.  Its device drivers policy.


My patch do not require the platform make the decision for the device 
driver.

Nor does it depend on DT. Your approaches are neither sufficient nor 
necessary.


It gives the freedom to the the device driver.

Device driver has full control over the buffer allocation.

For our hardware, It don't use DT on some application scene.

Out hardware is dma-coherent and cached coherent.

We don't want a dma-coherent buffer attached with the name of 
"map_noncoherent".


>> While my patch is trying to create a *mechanism* which could probably
>>
>> works on all platform.
>>
>>
>> It is based the patch you have already commuted.
>>
>> Thanks for your excellent contribution.
>>
>>
>>>>>    From what I understand if you add that property then Linux
>>>>> will
>>>>> use DMA
>>>>> coherent memory even though you use dma_alloc_noncoherent() and
>>>>> the
>>>>> sync_single_for_cpu() / sync_single_for_device() are then NOPs.
>>>> Please do not mitigate the problems with confusing method.
>>>>
>>>>
>>>> This approach not only tend to generate confusion but also
>>>> implement-dependent
>>>>
>>>> and arch-dependent. It's definitely problematic.
>>>>
>>>>
>>>> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH
>>>> specific
>>>> thing.
>>>>
>>>> Dependent on how does the arch_dma_ops is implemented.
>>>>
>>>>
>>>> The definition of the coherent on different ARCH has different
>>>> meanings.
>>>>
>>>> The definition of the wirte-combine on different ARCH has
>>>> different
>>>> meanings.
>>>>
>>>>
>>>> The wirte-combine(uncache acceleration) on mips is non dma-
>>>> coherent.
>>> It is dma-coherent on Ingenic SoCs.
>>>
>>>
>> It is dma-coherent ? How does it achieve it?
>>
>>
>> As far as I know,  there is a write buffer within the mips cpu.
>>
>> typically 64 byte,  but it is not cache. It will gather the CPU write
>> access,
>>
>> When a peripheral device do the DMA, how does you platform guarantee
>>
>> the data in the CPU write buffer has been already arrived at (or
>> flushed
>> out to)
>>
>> the system RAM?
>>
>>
>> Does the  peripheral device snoop the CPU's write buffer,
>>
>> or it need manually flush the write buffer with SYNC instruction?
> I believe the DMA flushes the write buffer? I don't actually know the
> details, it would be something to ask to Ingenic.
>
>>>> But on arm, It seem that wirte-combine is coherent. (guaranteed
>>>> by
>>>> arch
>>>> implement).
>>>>
>>>>
>>>> I also heard using dma_alloc_coherent  to allocation the buffer
>>>> for
>>>> the
>>>> non-coherent doesn't hurt, but the reverse is not true.
>>>>
>>>>
>>>> But please do not create confusion.
>>>>
>>>> software composite is faster because better cacheusing rate and
>>>>
>>>> cache is faster to read.
>>>>
>>>> It is faster because it is cached, not because it is non-
>>>> coherent.
>>>>
>>>> non-coherent is arch thing and/or driver-side thing,
>>>>
>>>> it is a side effect of  using the cached mapping.
>>> Yes, I know that.
>>>
>>>> It should left to driver to handle such a side effect. The device
>>>> driver
>>>>
>>>> know their device, so its the device driver's responsibility to
>>>> maintain
>>>> the coherency.  On loongson platform, we don't need to call
>>>> drm_fb_dma_sync_non_coherent() function, Its already guaranteed
>>>> by
>>>> hardware.
>>> I understand. What I'm saying, is that you should be able to set
>>> dma_obj->map_noncoherent (which would arguably be better named
>>> "map_cached",
>> My point is that the word *cached* reflect the nature,
>>
>> dma-coherent or dma-noncoherent is secondary.
>>
>> We are all on the way to pursue the performance.
>>
>> In the end, it is the cache give us the speed.
>>
>>
>> Why not we credit the cache hardware inside of the CPU?
> dma_alloc_noncoherent() gives you *cached* memory.
>
> Therefore, if you want *cached* memory, you should set
> gem->map_noncoherent.
>
> I understand your confusion; it would be easier to understand if this
> function was called dma_alloc_cached().
>
> Then, if the memory is actually DMA-coherent for the device (dev-
>> dma_coherent == true), the drm_fb_dma_sync_non_coherent() function is
> a no-op.
>
> But in both cases (DMA-coherent device, non DMA-coherent device), if
> you want cached buffers, you should call dma_alloc_noncoherent().
>
>
>>> but that's a different problem). Then the GEM code would
>>> end up calling dma_alloc_noncoherent(), which will give you
>>> *cached*
>>> memory. Then as long as dev->dma_coherent = true,
>>> drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't
>>> pointlessly sync/invalidate the caches.
>>>
>>> And I disagree with you, the driver shouldn't handle such things.
>> You already handle the side effect of such things, See below:
>>
>>
>> ```
>>
>>      if (ingenic_drm_map_noncoherent(ipu->master))
>>           drm_fb_dma_sync_non_coherent(ipu->drm, oldstate, newstate);
>>
>> ```
>>
>> By the way,  Ingenic is the only driver in the drivers/gpu/drm/ that
>> handle such things
>>
>> so far.
> Yes; and now I think that this was a bad idea (for the reasons Maxime
> listed in his email).
>
>>>    The
>>> fact that it is better to use cached memory or uncached with write-
>>> combine really is platform-specific and not something that the
>>> driver
>>> should be aware of.
>> But the fact is that,  It is drm/ingenic tell the drm core,  some SoC
>> is
>> prefer cached,
>>
>> but unable to enforce the coherent. So that it need  flush the cache
>> manually.
>>
>> What do you meant by saying that the driver should not be aware of ?
> Ideally, the driver should just call a function "dma_alloc_buffer",
> which would return cached memory when it makes sense, otherwise a
> uncached buffer with the write-combine attribute.
>
> Then the arch code (or DT) can decide what's the best setting, and not
> the driver.
>
> In the meantime, you should use gem->dma_noncoherent like the ingenic-
> drm driver does - until somebody (probably me) refactor things.
>
> Cheers,
> -Paul
>
>>> Cheers,
>>> -Paul
>>>
>>>>> Cheers,
>>>>> -Paul
>>>>>
>>>>>> Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
>>>>>> ---
>>>>>>     drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
>>>>>>     drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
>>>>>>     drivers/gpu/drm/drm_gem_dma_helper.c      | 20
>>>>>> ++++++++++++++++----
>>>>>>     drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
>>>>>>     drivers/gpu/drm/rcar-du/Kconfig           |  2 --
>>>>>>     drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
>>>>>>     include/drm/drm_gem_dma_helper.h          |  7 +++++--
>>>>>>     7 files changed, 34 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> index 3b535ad1b07c..93ff05041192 100644
>>>>>> --- a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> @@ -106,16 +106,15 @@ dma_addr_t
>>>>>> drm_fb_dma_get_gem_addr(struct
>>>>>> drm_framebuffer *fb,
>>>>>>     EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
>>>>>>     
>>>>>>     /**
>>>>>> - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-
>>>>>> coherent
>>>>>> backing
>>>>>> - *     memory
>>>>>> + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached
>>>>>> backing
>>>>>> memory
>>>>>>      * @drm: DRM device
>>>>>>      * @old_state: Old plane state
>>>>>>      * @state: New plane state
>>>>>>      *
>>>>>>      * This function can be used by drivers that use damage
>>>>>> clips
>>>>>> and
>>>>>> have
>>>>>> - * DMA GEM objects backed by non-coherent memory. Calling
>>>>>> this
>>>>>> function
>>>>>> - * in a plane's .atomic_update ensures that all the data in
>>>>>> the
>>>>>> backing
>>>>>> - * memory have been written to RAM.
>>>>>> + * DMA GEM objects backed by cached memory. Calling this
>>>>>> function in
>>>>>> a
>>>>>> + * plane's .atomic_update ensures that all the data in the
>>>>>> backing
>>>>>> memory
>>>>>> + * have been written to RAM.
>>>>>>      */
>>>>>>     void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
>>>>>>                                      struct drm_plane_state
>>>>>> *old_state,
>>>>>> @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
>>>>>> drm_device *drm,
>>>>>>     
>>>>>>            for (i = 0; i < finfo->num_planes; i++) {
>>>>>>                    dma_obj = drm_fb_dma_get_gem_obj(state->fb,
>>>>>> i);
>>>>>> -               if (!dma_obj->map_noncoherent)
>>>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>>>>                            continue;
>>>>>>     
>>>>>>                    daddr = drm_fb_dma_get_gem_addr(state->fb,
>>>>>> state, i);
>>>>>> diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> b/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> index d86773fa8ab0..49fe9b284cc8 100644
>>>>>> --- a/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> +++ b/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> @@ -131,7 +131,7 @@ static int
>>>>>> drm_fbdev_dma_helper_fb_probe(struct
>>>>>> drm_fb_helper *fb_helper,
>>>>>>     
>>>>>>            /* screen */
>>>>>>            info->flags |= FBINFO_VIRTFB; /* system memory */
>>>>>> -       if (dma_obj->map_noncoherent)
>>>>>> +       if (dma_obj->cached)
>>>>>>                    info->flags |= FBINFO_READS_FAST; /* signal
>>>>>> caching
>>>>>> */
>>>>>>            info->screen_size = sizes->surface_height * fb-
>>>>>>> pitches[0];
>>>>>>            info->screen_buffer = map.vaddr;
>>>>>> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> index 870b90b78bc4..dec1d512bdf1 100644
>>>>>> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device
>>>>>> *drm,
>>>>>> size_t size, bool private)
>>>>>>                    drm_gem_private_object_init(drm, gem_obj,
>>>>>> size);
>>>>>>     
>>>>>>                    /* Always use writecombine for dma-buf
>>>>>> mappings
>>>>>> */
>>>>>> -               dma_obj->map_noncoherent = false;
>>>>>> +               /* FIXME: This is not always true, on some
>>>>>> dma
>>>>>> coherent system,
>>>>>> +                * cached mappings should be preferred over
>>>>>> writecombine
>>>>>> +                */
>>>>>> +               dma_obj->cached = false;
>>>>>> +               dma_obj->coherent = false;
>>>>>>            } else {
>>>>>>                    ret = drm_gem_object_init(drm, gem_obj,
>>>>>> size);
>>>>>>            }
>>>>>> @@ -143,7 +147,11 @@ struct drm_gem_dma_object
>>>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>>>            if (IS_ERR(dma_obj))
>>>>>>                    return dma_obj;
>>>>>>     
>>>>>> -       if (dma_obj->map_noncoherent) {
>>>>>> +       if (dma_obj->cached && dma_obj->coherent) {
>>>>>> +               dma_obj->vaddr = dma_alloc_coherent(drm->dev,
>>>>>> size,
>>>>>> +                                                   &dma_obj-
>>>>>>> dma_addr,
>>>>>> +
>>>>>> GFP_KERNEL |
>>>>>> __GFP_NOWARN);
>>>>>> +       } else if (dma_obj->cached && !dma_obj->coherent) {
>>>>>>                    dma_obj->vaddr = dma_alloc_noncoherent(drm-
>>>>>>> dev,
>>>>>> size,
>>>>>>                                                          
>>>>>> &dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                          
>>>>>> DMA_TO_DEVICE,
>>>>>> @@ -153,6 +161,7 @@ struct drm_gem_dma_object
>>>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>>>                                                  &dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                  GFP_KERNEL |
>>>>>> __GFP_NOWARN);
>>>>>>            }
>>>>>> +
>>>>>>            if (!dma_obj->vaddr) {
>>>>>>                    drm_dbg(drm, "failed to allocate buffer
>>>>>> with
>>>>>> size
>>>>>> %zu\n",
>>>>>>                             size);
>>>>>> @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct
>>>>>> drm_gem_dma_object
>>>>>> *dma_obj)
>>>>>>                            dma_buf_vunmap_unlocked(gem_obj-
>>>>>>> import_attach->dmabuf, &map);
>>>>>>                    drm_prime_gem_destroy(gem_obj, dma_obj-
>>>>>>> sgt);
>>>>>>            } else if (dma_obj->vaddr) {
>>>>>> -               if (dma_obj->map_noncoherent)
>>>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>>>> +                       dma_free_coherent(gem_obj->dev->dev,
>>>>>> dma_obj-
>>>>>>> base.size,
>>>>>> +                                         dma_obj->vaddr,
>>>>>> dma_obj-
>>>>>>> dma_addr);
>>>>>> +               else if (dma_obj->cached && !dma_obj-
>>>>>>> coherent)
>>>>>>                            dma_free_noncoherent(gem_obj->dev-
>>>>>>> dev,
>>>>>> dma_obj->base.size,
>>>>>>                                                 dma_obj-
>>>>>>> vaddr,
>>>>>> dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                
>>>>>> DMA_TO_DEVICE);
>>>>>> @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct
>>>>>> drm_gem_dma_object
>>>>>> *dma_obj, struct vm_area_struct *
>>>>>>            vma->vm_pgoff -= drm_vma_node_start(&obj-
>>>>>>> vma_node);
>>>>>>            vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
>>>>>>     
>>>>>> -       if (dma_obj->map_noncoherent) {
>>>>>> +       if (dma_obj->cached) {
>>>>>>                    vma->vm_page_prot = vm_get_page_prot(vma-
>>>>>>> vm_flags);
>>>>>>     
>>>>>>                    ret = dma_mmap_pages(dma_obj->base.dev-
>>>>>>> dev,
>>>>>> diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> index 5ec75e9ba499..a3df2f99a757 100644
>>>>>> --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct
>>>>>> drm_device
>>>>>> *drm, size_t size)
>>>>>>            if (!obj)
>>>>>>                    return ERR_PTR(-ENOMEM);
>>>>>>     
>>>>>> -       obj->map_noncoherent = priv->soc_info-
>>>>>>> map_noncoherent;
>>>>>> +       if (priv->soc_info->map_noncoherent) {
>>>>>> +               obj->cached = true;
>>>>>> +               obj->coherent = false;
>>>>>> +       }
>>>>>>     
>>>>>>            return &obj->base;
>>>>>>     }
>>>>>> diff --git a/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> b/drivers/gpu/drm/rcar-
>>>>>> du/Kconfig
>>>>>> index 53c356aed5d5..dddc70c08bdc 100644
>>>>>> --- a/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> +++ b/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> @@ -2,8 +2,6 @@
>>>>>>     config DRM_RCAR_DU
>>>>>>            tristate "DRM Support for R-Car Display Unit"
>>>>>>            depends on DRM && OF
>>>>>> -       depends on ARM || ARM64
>>>>>> -       depends on ARCH_RENESAS || COMPILE_TEST
>>>>>>            select DRM_KMS_HELPER
>>>>>>            select DRM_GEM_DMA_HELPER
>>>>>>            select VIDEOMODE_HELPERS
>>>>>> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> index adfb36b0e815..1142d51473e6 100644
>>>>>> --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> @@ -386,7 +386,9 @@ struct drm_gem_object
>>>>>> *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
>>>>>>            gem_obj->funcs = &rcar_du_gem_funcs;
>>>>>>     
>>>>>>            drm_gem_private_object_init(dev, gem_obj, attach-
>>>>>>> dmabuf-
>>>>>>> size);
>>>>>> -       dma_obj->map_noncoherent = false;
>>>>>> +
>>>>>> +       dma_obj->cached = false;
>>>>>> +       dma_obj->coherent = false;
>>>>>>     
>>>>>>            ret = drm_gem_create_mmap_offset(gem_obj);
>>>>>>            if (ret) {
>>>>>> diff --git a/include/drm/drm_gem_dma_helper.h
>>>>>> b/include/drm/drm_gem_dma_helper.h
>>>>>> index 8a043235dad8..585ce3d4d1eb 100644
>>>>>> --- a/include/drm/drm_gem_dma_helper.h
>>>>>> +++ b/include/drm/drm_gem_dma_helper.h
>>>>>> @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
>>>>>>      *       more than one entry but they are guaranteed to
>>>>>> have
>>>>>> contiguous
>>>>>>      *       DMA addresses.
>>>>>>      * @vaddr: kernel virtual address of the backing memory
>>>>>> - * @map_noncoherent: if true, the GEM object is backed by
>>>>>> non-
>>>>>> coherent memory
>>>>>> + * @cached: if true, the GEM object is backed by cached
>>>>>> memory
>>>>>> + * @coherent: This option only meaningful when a GEM object
>>>>>> is
>>>>>> cached.
>>>>>> + *            If true, Sync the GEM object for DMA access is
>>>>>> not
>>>>>> required.
>>>>>>      */
>>>>>>     struct drm_gem_dma_object {
>>>>>>            struct drm_gem_object base;
>>>>>> @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
>>>>>>            /* For objects with DMA memory allocated by GEM DMA
>>>>>> */
>>>>>>            void *vaddr;
>>>>>>     
>>>>>> -       bool map_noncoherent;
>>>>>> +       bool cached;
>>>>>> +       bool coherent;
>>>>>>     };
>>>>>>     
>>>>>>     #define to_drm_gem_dma_obj(gem_obj) \
  
Sui Jingfeng June 8, 2023, 3:03 a.m. UTC | #8
Hi,

On 2023/6/7 20:19, Maxime Ripard wrote:
> On Wed, Jun 07, 2023 at 06:30:01PM +0800, Sui Jingfeng wrote:
>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>> The single map_noncoherent member of struct drm_gem_dma_object may
>>>> not
>>>> sufficient for describing the backing memory of the GEM buffer
>>>> object.
>>>>
>>>> Especially on dma-coherent systems, the backing memory is both cached
>>>> coherent for multi-core CPUs and dma-coherent for peripheral device.
>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64, etc.
>>>>
>>>> Whether a peripheral device is dma-coherent or not can be
>>>> implementation-dependent. The single map_noncoherent option is not
>>>> enough
>>>> to reflect real hardware anymore. For example, the Loongson LS3A4000
>>>> CPU
>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>> platform
>>>> allways snoop CPU's cache. Doing the allocation with
>>>> dma_alloc_coherent
>>>> function is preferred. The return buffer is cached, it should not
>>>> using
>>>> the default write-combine mapping. While with the current implement,
>>>> there
>>>> no way to tell the drm core to reflect this.
>>>>
>>>> This patch adds cached and coherent members to struct
>>>> drm_gem_dma_object.
>>>> which allow driver implements to inform the core. Introducing new
>>>> mappings
>>>> while keeping the original default behavior unchanged.
>>> Did you try to simply set the "dma-coherent" property to the device's
>>> node?
>> But this approach can only be applied for the device driver with DT support.
>>
>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically do not
>> have DT support.
>>
>> They using ACPI to pass parameter from the firmware to Linux kernel.
>>
>> You approach will lost the effectiveness on such a case.
> Not really, no. All DT support is doing is setting some generic device
> parameter based on that property, but the infrastructure is very much
> generic and can be used on systems without a DT.
>
>>>   From what I understand if you add that property then Linux will use DMA
>>> coherent memory even though you use dma_alloc_noncoherent() and the
>>> sync_single_for_cpu() / sync_single_for_device() are then NOPs.
>   >
>> Please do not mitigate the problems with confusing method.
> It's not a confusing method, it's one of the two main API to deal with
> DMA buffers. And you might disagree with Paul but there's no need to be
> rude about it.
>
>> This approach not only tend to generate confusion but also
>> implement-dependent and arch-dependent. It's definitely problematic.
>>
>>
>> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH specific
>> thing.
>>
>> Dependent on how does the arch_dma_ops is implemented.
>>
>>
>> The definition of the coherent on different ARCH has different meanings.
>>
>> The definition of the wirte-combine on different ARCH has different
>> meanings.
>>
>>
>> The wirte-combine(uncache acceleration) on mips is non dma-coherent.
> Then MIPS breaks the DMA allocation semantics. A buffer allocated with
> dma_alloc_wc is supposed to be coherent.
>
>> But on arm, It seem that wirte-combine is coherent. (guaranteed by arch
>> implement).
>>
>>
>> I also heard using dma_alloc_coherent  to allocation the buffer for the
>> non-coherent doesn't hurt, but the reverse is not true.
>>
>>
>> But please do not create confusion.
>>
>> software composite is faster because better cacheusing rate and
>>
>> cache is faster to read.
>>
>> It is faster because it is cached, not because it is non-coherent.
>>
>> non-coherent is arch thing and/or driver-side thing,
>>
>> it is a side effect of  using the cached mapping.
> Honestly, it's not clear to me what your point or issue is.
>
> Going back to the description in your commit log, you mention that you
> want to support multiple hardware that might or might not be DMA
> coherent, and thus you want to allocate a buffer with different
> attributes depending on that?
>
> Like, you say that the LS3A4000 has a coherency unit and thus doing the
> allocation with dma_alloc_coherent is preferred. Preferred to what? A WC
> buffer? Why?
>
> A WC buffer is a coherent buffer that is allowed to cache writes.
>
> It doesn't have to, and worst case scenario you're inexactly the same
> case than a dma_alloc_coherent buffer.
>
>> It should left to driver to handle such a side effect. The device
>> driver know their device, so its the device driver's responsibility to
>> maintain the coherency.
> Not really, no. Some driver are used across multiple SoCs and multiple
> arch. It doesn't make any sense to encode this in the driver... which is
> why it's in the DT in the first place, and abstracted away by the DMA
> API. Like, do you really expect the amdgpu driver to know the DMA
> attributes it needs to allocate a buffer from when running from a
> RaspberryPi?
>
>> On loongson platform, we don't need to call
>> drm_fb_dma_sync_non_coherent() function, Its already guaranteed by
>> hardware.
> And mostly guaranteed by dma_alloc_coherent. And if you wanted to call
> it anyway, it would be a nop if the device is declared as coherent
> already.
>
> I think you're thinking about this backward. A buffer has mapping
> attributes, and a device has hardware properties.
>
> The driver (ie, software) will allocate a buffer with some mapping
> attributes, and will assume that they are met in the rest of its code.
> How they are met is an implementation detail of the hardware, and for
> all the driver cares, it doesn't have to match.
>
> You can allocate a WC buffer to use on a non-coherent device and that's
> fine. You can allocate a non-coherent buffer on a coherent device and
> that's fine too. The DMA API will make everything work when it needs to,
> and if the hardware already provides stronger guarantees, then it will
> just skip whatever is redundant.
>
> So you need to write your driver using buffer is the most convenient for
> you, and it's really all that matters at the driver level. But for that
> to work, you need to flag the coherence-ness of your devices properly,
> like Paul suggested.

I seems that you are right, at least at ARM world.

Thanks for you tell me this, I might be wrong, this patch may have small 
problems.

I think I should take more time to investigate this problem.

But I need to do more test before I can reply , thank Paul also.

> Maxime
  
Maxime Ripard June 8, 2023, 7:39 a.m. UTC | #9
On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
> Hi,
> 
> On 2023/6/8 00:12, Paul Cercueil wrote:
> > Hi Sui,
> > 
> > Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
> > > Hi,  welcome to discussion.
> > > 
> > > 
> > > I have limited skills in manipulating English.
> > > 
> > > It may not express what I'm really means in the short time.
> > > 
> > > Part of word in the sentence may not as accurate as your.
> > > 
> > > Well, please don't misunderstand, I'm not doing the rude to you.
> > No problem.
> > 
> > > I will explain it with more details.
> > > 
> > > See below:
> > > 
> > > 
> > > On 2023/6/7 20:09, Paul Cercueil wrote:
> > > > Hi Sui,
> > > > 
> > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> > > > > Hi,
> > > > > 
> > > > > 
> > > > > On 2023/6/7 17:36, Paul Cercueil wrote:
> > > > > > Hi Sui,
> > > > > > 
> > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > > > > > The single map_noncoherent member of struct
> > > > > > > drm_gem_dma_object
> > > > > > > may
> > > > > > > not
> > > > > > > sufficient for describing the backing memory of the GEM
> > > > > > > buffer
> > > > > > > object.
> > > > > > > 
> > > > > > > Especially on dma-coherent systems, the backing memory is
> > > > > > > both
> > > > > > > cached
> > > > > > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > > > > > device.
> > > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64,
> > > > > > > etc.
> > > > > > > 
> > > > > > > Whether a peripheral device is dma-coherent or not can be
> > > > > > > implementation-dependent. The single map_noncoherent option
> > > > > > > is
> > > > > > > not
> > > > > > > enough
> > > > > > > to reflect real hardware anymore. For example, the Loongson
> > > > > > > LS3A4000
> > > > > > > CPU
> > > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > > > > > platform
> > > > > > > allways snoop CPU's cache. Doing the allocation with
> > > > > > > dma_alloc_coherent
> > > > > > > function is preferred. The return buffer is cached, it should
> > > > > > > not
> > > > > > > using
> > > > > > > the default write-combine mapping. While with the current
> > > > > > > implement,
> > > > > > > there
> > > > > > > no way to tell the drm core to reflect this.
> > > > > > > 
> > > > > > > This patch adds cached and coherent members to struct
> > > > > > > drm_gem_dma_object.
> > > > > > > which allow driver implements to inform the core. Introducing
> > > > > > > new
> > > > > > > mappings
> > > > > > > while keeping the original default behavior unchanged.
> > > > > > Did you try to simply set the "dma-coherent" property to the
> > > > > > device's
> > > > > > node?
> > > > > But this approach can only be applied for the device driver with
> > > > > DT
> > > > > support.
> > > > > 
> > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
> > > > > do
> > > > > not
> > > > > have DT support.
> > > > > 
> > > > > They using ACPI to pass parameter from the firmware to Linux
> > > > > kernel.
> > > > > 
> > > > > You approach will lost the effectiveness on such a case.
> > > > Well, I don't really know how ACPI handles it - but it should just
> > > > be a
> > > > matter of setting dev->dma_coherent. That's basically what the DT
> > > > code
> > > > does.
> > > > 
> > > > Some MIPS boards set it in their setup code for instance.
> > > > 
> > > This is a *strategy*, not a *mechanism*.
> > > 
> > > In this case, DT is just used to describing the hardware.
> > > 
> > > (It is actually a hardware feature describing language, the
> > > granularity
> > > is large)
> > > 
> > > It does not changing the state of the hardware.
> > > 
> > > It's your platform firmware or kernel setting up code who actually do
> > > such a things.
> > > 
> > > 
> > > It's just that it works on *one* platform, it does not guarantee it
> > > will
> > > works on others.
> > If you add the "dma-coherent" property in a device node in DT, you
> > effectively specify that the device is DMA-coherent; so you describe
> > the hardware, which is what DT is for, and you are not changing the
> > state of the hardware.
> > 
> > Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> > default to DMA-coherent mapping; I believe you could do something
> > similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
> > 
> The preblem is that device driver can have various demand.
> 
> It probably want to create different kind of buffers for different thing
> simultaneously.
> 
> Say, one allocated with dma_alloc_coherent for command buffer or dma
> descriptor
> 
> another one allocated with  dma_alloc_wc for uploading shader etc.
> 
> also has the third one allocated with dma_alloc_noncoherent() for doing some
> else.

And it will work just fine.

struct device dma_coherent, or DT's dma-coherent property define that
the device doesn't need any kind of cache maintenance, ever. If it's
missing, we need to perform cache maintenance to keep coherency.

dma_alloc_* functions provide guarantees to the driver. With
dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
you don't need to perform cache maintenance operations by hand in the
driver.

With dma_alloc_noncoherent, the buffer is non-coherent and the driver
needs to perform them when relevant.

How those buffers are created is platform specific, but the guarantees
provided *to the driver* are always there.

A buffer allocated with dma_alloc_coherent might be provided by
different means (at the hardware level with a coherency unit, by mapping
it non-cacheable), but as far as the driver is concerned it's always
going to be coherent.

Similarly, a driver using dma_alloc_noncoherent will always require
cache maintenance operations to use the API properly, even if the
hardware provides coherency (in which case, those operations will be
nop).

So, yeah, like I was saying in the other mail, it looks like you're
confusing a bunch of things. dma_alloc_* functions are about the driver
expectations and guarantees. DT's dma-coherent property is about how we
can implement them on a given platform.

They don't have to match, and that's precisely how we can have drivers
that run on any combination of platforms: the driver only cares about
the buffer guarantees, the platform description takes care of how they
are implemented.

Maxime
  
Sui Jingfeng June 8, 2023, 8:17 a.m. UTC | #10
Hi,

On 2023/6/8 15:39, Maxime Ripard wrote:
> On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
>> Hi,
>>
>> On 2023/6/8 00:12, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
>>>> Hi,  welcome to discussion.
>>>>
>>>>
>>>> I have limited skills in manipulating English.
>>>>
>>>> It may not express what I'm really means in the short time.
>>>>
>>>> Part of word in the sentence may not as accurate as your.
>>>>
>>>> Well, please don't misunderstand, I'm not doing the rude to you.
>>> No problem.
>>>
>>>> I will explain it with more details.
>>>>
>>>> See below:
>>>>
>>>>
>>>> On 2023/6/7 20:09, Paul Cercueil wrote:
>>>>> Hi Sui,
>>>>>
>>>>> Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>>>>>> Hi Sui,
>>>>>>>
>>>>>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>>>>>> The single map_noncoherent member of struct
>>>>>>>> drm_gem_dma_object
>>>>>>>> may
>>>>>>>> not
>>>>>>>> sufficient for describing the backing memory of the GEM
>>>>>>>> buffer
>>>>>>>> object.
>>>>>>>>
>>>>>>>> Especially on dma-coherent systems, the backing memory is
>>>>>>>> both
>>>>>>>> cached
>>>>>>>> coherent for multi-core CPUs and dma-coherent for peripheral
>>>>>>>> device.
>>>>>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64,
>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Whether a peripheral device is dma-coherent or not can be
>>>>>>>> implementation-dependent. The single map_noncoherent option
>>>>>>>> is
>>>>>>>> not
>>>>>>>> enough
>>>>>>>> to reflect real hardware anymore. For example, the Loongson
>>>>>>>> LS3A4000
>>>>>>>> CPU
>>>>>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>>>>>> platform
>>>>>>>> allways snoop CPU's cache. Doing the allocation with
>>>>>>>> dma_alloc_coherent
>>>>>>>> function is preferred. The return buffer is cached, it should
>>>>>>>> not
>>>>>>>> using
>>>>>>>> the default write-combine mapping. While with the current
>>>>>>>> implement,
>>>>>>>> there
>>>>>>>> no way to tell the drm core to reflect this.
>>>>>>>>
>>>>>>>> This patch adds cached and coherent members to struct
>>>>>>>> drm_gem_dma_object.
>>>>>>>> which allow driver implements to inform the core. Introducing
>>>>>>>> new
>>>>>>>> mappings
>>>>>>>> while keeping the original default behavior unchanged.
>>>>>>> Did you try to simply set the "dma-coherent" property to the
>>>>>>> device's
>>>>>>> node?
>>>>>> But this approach can only be applied for the device driver with
>>>>>> DT
>>>>>> support.
>>>>>>
>>>>>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
>>>>>> do
>>>>>> not
>>>>>> have DT support.
>>>>>>
>>>>>> They using ACPI to pass parameter from the firmware to Linux
>>>>>> kernel.
>>>>>>
>>>>>> You approach will lost the effectiveness on such a case.
>>>>> Well, I don't really know how ACPI handles it - but it should just
>>>>> be a
>>>>> matter of setting dev->dma_coherent. That's basically what the DT
>>>>> code
>>>>> does.
>>>>>
>>>>> Some MIPS boards set it in their setup code for instance.
>>>>>
>>>> This is a *strategy*, not a *mechanism*.
>>>>
>>>> In this case, DT is just used to describing the hardware.
>>>>
>>>> (It is actually a hardware feature describing language, the
>>>> granularity
>>>> is large)
>>>>
>>>> It does not changing the state of the hardware.
>>>>
>>>> It's your platform firmware or kernel setting up code who actually do
>>>> such a things.
>>>>
>>>>
>>>> It's just that it works on *one* platform, it does not guarantee it
>>>> will
>>>> works on others.
>>> If you add the "dma-coherent" property in a device node in DT, you
>>> effectively specify that the device is DMA-coherent; so you describe
>>> the hardware, which is what DT is for, and you are not changing the
>>> state of the hardware.
>>>
>>> Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
>>> default to DMA-coherent mapping; I believe you could do something
>>> similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
>>>
>> The preblem is that device driver can have various demand.
>>
>> It probably want to create different kind of buffers for different thing
>> simultaneously.
>>
>> Say, one allocated with dma_alloc_coherent for command buffer or dma
>> descriptor
>>
>> another one allocated with  dma_alloc_wc for uploading shader etc.
>>
>> also has the third one allocated with dma_alloc_noncoherent() for doing some
>> else.
> And it will work just fine.
>
> struct device dma_coherent, or DT's dma-coherent property define that
> the device doesn't need any kind of cache maintenance, ever. If it's
> missing, we need to perform cache maintenance to keep coherency.
>
> dma_alloc_* functions provide guarantees to the driver. With
> dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
> you don't need to perform cache maintenance operations by hand in the
> driver.

BO returned by dma_alloc_wc() doesn't works on some platform.

This may only guarantee for the CPU side. There is no guarantee for the 
GPU side.

For example, the GPU always snoop CPU's cache. The GPU fetch data from 
the CPU's cache if hit.

if not hit, the GPU fetch the data from the system RAM.


But when call dma_alloc_wc(), the BO at cpu side is marked as write 
combine property.

The write buffer within the CPU will gather the CPU side write access.

This is to say, there may have some data reside(stall) in the write buffer.

while the GPU will fetch data from the system RAM or CPU's cache.

the GPU will fetch wrong data.


This is the condition for our hardware, I don't know how does the ARM 
platform guarantee

the coherency in this case.


If it relay on software to guarantee, then it is still non hardware 
maintained coherency.


When it relay on software, I called it implement-dependent.

there are some archs without the implement or don't know how to implement.


If it can't even snoop cpu's cache, I don't believe it can snoop cpu's 
write buffer.

I not sure dma api can do guarantee for all arch.


> With dma_alloc_noncoherent, the buffer is non-coherent and the driver
> needs to perform them when relevant.
>
> How those buffers are created is platform specific, but the guarantees
> provided *to the driver* are always there.
>
> A buffer allocated with dma_alloc_coherent might be provided by
> different means (at the hardware level with a coherency unit, by mapping
> it non-cacheable), but as far as the driver is concerned it's always
> going to be coherent.
>
> Similarly, a driver using dma_alloc_noncoherent will always require
> cache maintenance operations to use the API properly, even if the
> hardware provides coherency (in which case, those operations will be
> nop).
>
> So, yeah, like I was saying in the other mail, it looks like you're
> confusing a bunch of things. dma_alloc_* functions are about the driver
> expectations and guarantees. DT's dma-coherent property is about how we
> can implement them on a given platform.

That is ideal situation.

You don't have seen the actual bugs.

Yeah, I do have a bit confusing about the DMA api.

Maybe you and Paul can continue work on this.


But DT's dma-coherent property is definitely not a system level solution.

drm/amdgpu, drm/radeon and drm/i915 don't support DT.

If ARM is dma-noncoherent, I suspect drm/amdgpu, drm/radeon will not works on ARM.

there no function call dma_sync_for_device() dma_sync_for_cpu() etc

These driver assume dma-coherent hardware.

> They don't have to match, and that's precisely how we can have drivers
> that run on any combination of platforms: the driver only cares about
> the buffer guarantees, the platform description takes care of how they
> are implemented.
>
> Maxime
  
Maxime Ripard June 12, 2023, 3:15 p.m. UTC | #11
On Thu, Jun 08, 2023 at 04:17:52PM +0800, Sui Jingfeng wrote:
> Hi,
> 
> On 2023/6/8 15:39, Maxime Ripard wrote:
> > On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
> > > Hi,
> > > 
> > > On 2023/6/8 00:12, Paul Cercueil wrote:
> > > > Hi Sui,
> > > > 
> > > > Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
> > > > > Hi,  welcome to discussion.
> > > > > 
> > > > > 
> > > > > I have limited skills in manipulating English.
> > > > > 
> > > > > It may not express what I'm really means in the short time.
> > > > > 
> > > > > Part of word in the sentence may not as accurate as your.
> > > > > 
> > > > > Well, please don't misunderstand, I'm not doing the rude to you.
> > > > No problem.
> > > > 
> > > > > I will explain it with more details.
> > > > > 
> > > > > See below:
> > > > > 
> > > > > 
> > > > > On 2023/6/7 20:09, Paul Cercueil wrote:
> > > > > > Hi Sui,
> > > > > > 
> > > > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > 
> > > > > > > On 2023/6/7 17:36, Paul Cercueil wrote:
> > > > > > > > Hi Sui,
> > > > > > > > 
> > > > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > > > > > > > The single map_noncoherent member of struct
> > > > > > > > > drm_gem_dma_object
> > > > > > > > > may
> > > > > > > > > not
> > > > > > > > > sufficient for describing the backing memory of the GEM
> > > > > > > > > buffer
> > > > > > > > > object.
> > > > > > > > > 
> > > > > > > > > Especially on dma-coherent systems, the backing memory is
> > > > > > > > > both
> > > > > > > > > cached
> > > > > > > > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > > > > > > > device.
> > > > > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64,
> > > > > > > > > etc.
> > > > > > > > > 
> > > > > > > > > Whether a peripheral device is dma-coherent or not can be
> > > > > > > > > implementation-dependent. The single map_noncoherent option
> > > > > > > > > is
> > > > > > > > > not
> > > > > > > > > enough
> > > > > > > > > to reflect real hardware anymore. For example, the Loongson
> > > > > > > > > LS3A4000
> > > > > > > > > CPU
> > > > > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > > > > > > > platform
> > > > > > > > > allways snoop CPU's cache. Doing the allocation with
> > > > > > > > > dma_alloc_coherent
> > > > > > > > > function is preferred. The return buffer is cached, it should
> > > > > > > > > not
> > > > > > > > > using
> > > > > > > > > the default write-combine mapping. While with the current
> > > > > > > > > implement,
> > > > > > > > > there
> > > > > > > > > no way to tell the drm core to reflect this.
> > > > > > > > > 
> > > > > > > > > This patch adds cached and coherent members to struct
> > > > > > > > > drm_gem_dma_object.
> > > > > > > > > which allow driver implements to inform the core. Introducing
> > > > > > > > > new
> > > > > > > > > mappings
> > > > > > > > > while keeping the original default behavior unchanged.
> > > > > > > > Did you try to simply set the "dma-coherent" property to the
> > > > > > > > device's
> > > > > > > > node?
> > > > > > > But this approach can only be applied for the device driver with
> > > > > > > DT
> > > > > > > support.
> > > > > > > 
> > > > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
> > > > > > > do
> > > > > > > not
> > > > > > > have DT support.
> > > > > > > 
> > > > > > > They using ACPI to pass parameter from the firmware to Linux
> > > > > > > kernel.
> > > > > > > 
> > > > > > > You approach will lost the effectiveness on such a case.
> > > > > > Well, I don't really know how ACPI handles it - but it should just
> > > > > > be a
> > > > > > matter of setting dev->dma_coherent. That's basically what the DT
> > > > > > code
> > > > > > does.
> > > > > > 
> > > > > > Some MIPS boards set it in their setup code for instance.
> > > > > > 
> > > > > This is a *strategy*, not a *mechanism*.
> > > > > 
> > > > > In this case, DT is just used to describing the hardware.
> > > > > 
> > > > > (It is actually a hardware feature describing language, the
> > > > > granularity
> > > > > is large)
> > > > > 
> > > > > It does not changing the state of the hardware.
> > > > > 
> > > > > It's your platform firmware or kernel setting up code who actually do
> > > > > such a things.
> > > > > 
> > > > > 
> > > > > It's just that it works on *one* platform, it does not guarantee it
> > > > > will
> > > > > works on others.
> > > > If you add the "dma-coherent" property in a device node in DT, you
> > > > effectively specify that the device is DMA-coherent; so you describe
> > > > the hardware, which is what DT is for, and you are not changing the
> > > > state of the hardware.
> > > > 
> > > > Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> > > > default to DMA-coherent mapping; I believe you could do something
> > > > similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
> > > > 
> > > The preblem is that device driver can have various demand.
> > > 
> > > It probably want to create different kind of buffers for different thing
> > > simultaneously.
> > > 
> > > Say, one allocated with dma_alloc_coherent for command buffer or dma
> > > descriptor
> > > 
> > > another one allocated with  dma_alloc_wc for uploading shader etc.
> > > 
> > > also has the third one allocated with dma_alloc_noncoherent() for doing some
> > > else.
> > And it will work just fine.
> > 
> > struct device dma_coherent, or DT's dma-coherent property define that
> > the device doesn't need any kind of cache maintenance, ever. If it's
> > missing, we need to perform cache maintenance to keep coherency.
> > 
> > dma_alloc_* functions provide guarantees to the driver. With
> > dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
> > you don't need to perform cache maintenance operations by hand in the
> > driver.
> 
> BO returned by dma_alloc_wc() doesn't works on some platform.

What do you mean by "it doesn't work" ? Because then, it's definitely
the platform that's broken if it doesn't provide a coherent buffer.

> This may only guarantee for the CPU side. There is no guarantee for
> the GPU side.
> 
> For example, the GPU always snoop CPU's cache. The GPU fetch data from
> the CPU's cache if hit.
> 
> if not hit, the GPU fetch the data from the system RAM.

Right, that's still a coherent buffer.

> But when call dma_alloc_wc(), the BO at cpu side is marked as write
> combine property.
> 
> The write buffer within the CPU will gather the CPU side write access.
> 
> This is to say, there may have some data reside(stall) in the write buffer.
> 
> while the GPU will fetch data from the system RAM or CPU's cache.
> 
> the GPU will fetch wrong data.

If that's the case, your buffer isn't coherent, and your platform
probably breaks the expectations of the DMA API.

> This is the condition for our hardware, I don't know how does the ARM
> platform guarantee
> 
> the coherency in this case.
>
> If it relay on software to guarantee, then it is still non hardware
> maintained coherency.
> 
> 
> When it relay on software, I called it implement-dependent.
> 
> there are some archs without the implement or don't know how to implement.
> 
> 
> If it can't even snoop cpu's cache, I don't believe it can snoop cpu's write
> buffer.
> 
> I not sure dma api can do guarantee for all arch.
> 
> 
> > With dma_alloc_noncoherent, the buffer is non-coherent and the driver
> > needs to perform them when relevant.
> > 
> > How those buffers are created is platform specific, but the guarantees
> > provided *to the driver* are always there.
> > 
> > A buffer allocated with dma_alloc_coherent might be provided by
> > different means (at the hardware level with a coherency unit, by mapping
> > it non-cacheable), but as far as the driver is concerned it's always
> > going to be coherent.
> > 
> > Similarly, a driver using dma_alloc_noncoherent will always require
> > cache maintenance operations to use the API properly, even if the
> > hardware provides coherency (in which case, those operations will be
> > nop).
> > 
> > So, yeah, like I was saying in the other mail, it looks like you're
> > confusing a bunch of things. dma_alloc_* functions are about the driver
> > expectations and guarantees. DT's dma-coherent property is about how we
> > can implement them on a given platform.
> 
> That is ideal situation.
> 
> You don't have seen the actual bugs.

Probably not :)

> Yeah, I do have a bit confusing about the DMA api.
> 
> Maybe you and Paul can continue work on this.

As far as I'm concerned, there's nothing to fix but your platform
support. I won't work on that.

> But DT's dma-coherent property is definitely not a system level
> solution.
> 
> drm/amdgpu, drm/radeon and drm/i915 don't support DT.
> 
> If ARM is dma-noncoherent, I suspect drm/amdgpu, drm/radeon will not
> works on ARM.
> 
> there no function call dma_sync_for_device() dma_sync_for_cpu() etc
> 
> These driver assume dma-coherent hardware.

No, these drivers assume they have a dma-coherent *buffer*. Which on
devices that don't get hardware coherency mean that they probably get a
non-cacheable buffer.

Maxime
  
Sui Jingfeng June 13, 2023, 12:27 p.m. UTC | #12
Hi,

On 2023/6/8 00:12, Paul Cercueil wrote:
> Hi Sui,
>
> Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
>> Hi,  welcome to discussion.
>>
>>
>> I have limited skills in manipulating English.
>>
>> It may not express what I'm really means in the short time.
>>
>> Part of word in the sentence may not as accurate as your.
>>
>> Well, please don't misunderstand, I'm not doing the rude to you.
> No problem.
>
>> I will explain it with more details.
>>
>> See below:
>>
>>
>> On 2023/6/7 20:09, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
>>>> Hi,
>>>>
>>>>
>>>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>>>> Hi Sui,
>>>>>
>>>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>>>> The single map_noncoherent member of struct
>>>>>> drm_gem_dma_object
>>>>>> may
>>>>>> not
>>>>>> sufficient for describing the backing memory of the GEM
>>>>>> buffer
>>>>>> object.
>>>>>>
>>>>>> Especially on dma-coherent systems, the backing memory is
>>>>>> both
>>>>>> cached
>>>>>> coherent for multi-core CPUs and dma-coherent for peripheral
>>>>>> device.
>>>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64,
>>>>>> etc.
>>>>>>
>>>>>> Whether a peripheral device is dma-coherent or not can be
>>>>>> implementation-dependent. The single map_noncoherent option
>>>>>> is
>>>>>> not
>>>>>> enough
>>>>>> to reflect real hardware anymore. For example, the Loongson
>>>>>> LS3A4000
>>>>>> CPU
>>>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>>>> platform
>>>>>> allways snoop CPU's cache. Doing the allocation with
>>>>>> dma_alloc_coherent
>>>>>> function is preferred. The return buffer is cached, it should
>>>>>> not
>>>>>> using
>>>>>> the default write-combine mapping. While with the current
>>>>>> implement,
>>>>>> there
>>>>>> no way to tell the drm core to reflect this.
>>>>>>
>>>>>> This patch adds cached and coherent members to struct
>>>>>> drm_gem_dma_object.
>>>>>> which allow driver implements to inform the core. Introducing
>>>>>> new
>>>>>> mappings
>>>>>> while keeping the original default behavior unchanged.
>>>>> Did you try to simply set the "dma-coherent" property to the
>>>>> device's
>>>>> node?
>>>> But this approach can only be applied for the device driver with
>>>> DT
>>>> support.
>>>>
>>>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
>>>> do
>>>> not
>>>> have DT support.
>>>>
>>>> They using ACPI to pass parameter from the firmware to Linux
>>>> kernel.
>>>>
>>>> You approach will lost the effectiveness on such a case.
>>> Well, I don't really know how ACPI handles it - but it should just
>>> be a
>>> matter of setting dev->dma_coherent. That's basically what the DT
>>> code
>>> does.
>>>
>>> Some MIPS boards set it in their setup code for instance.
>>>
>> This is a *strategy*, not a *mechanism*.
>>
>> In this case, DT is just used to describing the hardware.
>>
>> (It is actually a hardware feature describing language, the
>> granularity
>> is large)
>>
>> It does not changing the state of the hardware.
>>
>> It's your platform firmware or kernel setting up code who actually do
>> such a things.
>>
>>
>> It's just that it works on *one* platform, it does not guarantee it
>> will
>> works on others.
> If you add the "dma-coherent" property in a device node in DT, you
> effectively specify that the device is DMA-coherent; so you describe
> the hardware, which is what DT is for, and you are not changing the
> state of the hardware.
>
> Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> default to DMA-coherent mapping; I believe you could do something
> similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
>
>
>> While my patch is trying to create a *mechanism* which could probably
>>
>> works on all platform.
>>
>>
>> It is based the patch you have already commuted.
>>
>> Thanks for your excellent contribution.
>>
>>
>>>>>    From what I understand if you add that property then Linux
>>>>> will
>>>>> use DMA
>>>>> coherent memory even though you use dma_alloc_noncoherent() and
>>>>> the
>>>>> sync_single_for_cpu() / sync_single_for_device() are then NOPs.
>>>> Please do not mitigate the problems with confusing method.
>>>>
>>>>
>>>> This approach not only tend to generate confusion but also
>>>> implement-dependent
>>>>
>>>> and arch-dependent. It's definitely problematic.
>>>>
>>>>
>>>> How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH
>>>> specific
>>>> thing.
>>>>
>>>> Dependent on how does the arch_dma_ops is implemented.
>>>>
>>>>
>>>> The definition of the coherent on different ARCH has different
>>>> meanings.
>>>>
>>>> The definition of the wirte-combine on different ARCH has
>>>> different
>>>> meanings.
>>>>
>>>>
>>>> The wirte-combine(uncache acceleration) on mips is non dma-
>>>> coherent.
>>> It is dma-coherent on Ingenic SoCs.
>>>
>>>
>> It is dma-coherent ? How does it achieve it?
>>
>>
>> As far as I know,  there is a write buffer within the mips cpu.
>>
>> typically 64 byte,  but it is not cache. It will gather the CPU write
>> access,
>>
>> When a peripheral device do the DMA, how does you platform guarantee
>>
>> the data in the CPU write buffer has been already arrived at (or
>> flushed
>> out to)
>>
>> the system RAM?
>>
>>
>> Does the  peripheral device snoop the CPU's write buffer,
>>
>> or it need manually flush the write buffer with SYNC instruction?
> I believe the DMA flushes the write buffer? I don't actually know the
> details, it would be something to ask to Ingenic.
>
>>>> But on arm, It seem that wirte-combine is coherent. (guaranteed
>>>> by
>>>> arch
>>>> implement).
>>>>
>>>>
>>>> I also heard using dma_alloc_coherent  to allocation the buffer
>>>> for
>>>> the
>>>> non-coherent doesn't hurt, but the reverse is not true.
>>>>
>>>>
>>>> But please do not create confusion.
>>>>
>>>> software composite is faster because better cacheusing rate and
>>>>
>>>> cache is faster to read.
>>>>
>>>> It is faster because it is cached, not because it is non-
>>>> coherent.
>>>>
>>>> non-coherent is arch thing and/or driver-side thing,
>>>>
>>>> it is a side effect of  using the cached mapping.
>>> Yes, I know that.
>>>
>>>> It should left to driver to handle such a side effect. The device
>>>> driver
>>>>
>>>> know their device, so its the device driver's responsibility to
>>>> maintain
>>>> the coherency.  On loongson platform, we don't need to call
>>>> drm_fb_dma_sync_non_coherent() function, Its already guaranteed
>>>> by
>>>> hardware.
>>> I understand. What I'm saying, is that you should be able to set
>>> dma_obj->map_noncoherent (which would arguably be better named
>>> "map_cached",
>> My point is that the word *cached* reflect the nature,
>>
>> dma-coherent or dma-noncoherent is secondary.
>>
>> We are all on the way to pursue the performance.
>>
>> In the end, it is the cache give us the speed.
>>
>>
>> Why not we credit the cache hardware inside of the CPU?
> dma_alloc_noncoherent() gives you *cached* memory.
>
> Therefore, if you want *cached* memory, you should set
> gem->map_noncoherent.
>
> I understand your confusion; it would be easier to understand if this
> function was called dma_alloc_cached().
>
> Then, if the memory is actually DMA-coherent for the device (dev-
>> dma_coherent == true), the drm_fb_dma_sync_non_coherent() function is
> a no-op.
>
> But in both cases (DMA-coherent device, non DMA-coherent device), if
> you want cached buffers, you should call dma_alloc_noncoherent().
>
>
>>> but that's a different problem). Then the GEM code would
>>> end up calling dma_alloc_noncoherent(), which will give you
>>> *cached*
>>> memory. Then as long as dev->dma_coherent = true,
>>> drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't
>>> pointlessly sync/invalidate the caches.
>>>
>>> And I disagree with you, the driver shouldn't handle such things.
>> You already handle the side effect of such things, See below:
>>
>>
>> ```
>>
>>      if (ingenic_drm_map_noncoherent(ipu->master))
>>           drm_fb_dma_sync_non_coherent(ipu->drm, oldstate, newstate);
>>
>> ```
>>
>> By the way,  Ingenic is the only driver in the drivers/gpu/drm/ that
>> handle such things
>>
>> so far.
> Yes; and now I think that this was a bad idea (for the reasons Maxime
> listed in his email).
>
>>>    The
>>> fact that it is better to use cached memory or uncached with write-
>>> combine really is platform-specific and not something that the
>>> driver
>>> should be aware of.
>> But the fact is that,  It is drm/ingenic tell the drm core,  some SoC
>> is
>> prefer cached,
>>
>> but unable to enforce the coherent. So that it need  flush the cache
>> manually.
>>
>> What do you meant by saying that the driver should not be aware of ?
> Ideally, the driver should just call a function "dma_alloc_buffer",
> which would return cached memory when it makes sense, otherwise a
> uncached buffer with the write-combine attribute.
>
> Then the arch code (or DT) can decide what's the best setting, and not
> the driver.
>
> In the meantime, you should use gem->dma_noncoherent like the ingenic-
> drm driver does - until somebody (probably me) refactor things.

ingenic-drm using fbdev-generic, while fbdev-generic allocate a shadow 
buffer in the system

RAM, it will copy the damaged area from the system RAM to the real 
framebuffer at a regular

interval. But fbdev-generic does not flush the cached when doing the 
dirty update.


Now that ingenic display controller is not hardware maintained cache 
coherency,

in this case, I think there will some framebuffer data stall in the 
hardware cache,

there will be artificial visible because lost of consistency when switch 
to the

virtual terminal(using fbcon).  Am I right ?


Paul,  can you confirm that?

> Cheers,
> -Paul
>
>>> Cheers,
>>> -Paul
>>>
>>>>> Cheers,
>>>>> -Paul
>>>>>
>>>>>> Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn>
>>>>>> ---
>>>>>>     drivers/gpu/drm/drm_fb_dma_helper.c       | 11 +++++------
>>>>>>     drivers/gpu/drm/drm_fbdev_dma.c           |  2 +-
>>>>>>     drivers/gpu/drm/drm_gem_dma_helper.c      | 20
>>>>>> ++++++++++++++++----
>>>>>>     drivers/gpu/drm/ingenic/ingenic-drm-drv.c |  5 ++++-
>>>>>>     drivers/gpu/drm/rcar-du/Kconfig           |  2 --
>>>>>>     drivers/gpu/drm/rcar-du/rcar_du_kms.c     |  4 +++-
>>>>>>     include/drm/drm_gem_dma_helper.h          |  7 +++++--
>>>>>>     7 files changed, 34 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> index 3b535ad1b07c..93ff05041192 100644
>>>>>> --- a/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> +++ b/drivers/gpu/drm/drm_fb_dma_helper.c
>>>>>> @@ -106,16 +106,15 @@ dma_addr_t
>>>>>> drm_fb_dma_get_gem_addr(struct
>>>>>> drm_framebuffer *fb,
>>>>>>     EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
>>>>>>     
>>>>>>     /**
>>>>>> - * drm_fb_dma_sync_non_coherent - Sync GEM object to non-
>>>>>> coherent
>>>>>> backing
>>>>>> - *     memory
>>>>>> + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached
>>>>>> backing
>>>>>> memory
>>>>>>      * @drm: DRM device
>>>>>>      * @old_state: Old plane state
>>>>>>      * @state: New plane state
>>>>>>      *
>>>>>>      * This function can be used by drivers that use damage
>>>>>> clips
>>>>>> and
>>>>>> have
>>>>>> - * DMA GEM objects backed by non-coherent memory. Calling
>>>>>> this
>>>>>> function
>>>>>> - * in a plane's .atomic_update ensures that all the data in
>>>>>> the
>>>>>> backing
>>>>>> - * memory have been written to RAM.
>>>>>> + * DMA GEM objects backed by cached memory. Calling this
>>>>>> function in
>>>>>> a
>>>>>> + * plane's .atomic_update ensures that all the data in the
>>>>>> backing
>>>>>> memory
>>>>>> + * have been written to RAM.
>>>>>>      */
>>>>>>     void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
>>>>>>                                      struct drm_plane_state
>>>>>> *old_state,
>>>>>> @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct
>>>>>> drm_device *drm,
>>>>>>     
>>>>>>            for (i = 0; i < finfo->num_planes; i++) {
>>>>>>                    dma_obj = drm_fb_dma_get_gem_obj(state->fb,
>>>>>> i);
>>>>>> -               if (!dma_obj->map_noncoherent)
>>>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>>>>                            continue;
>>>>>>     
>>>>>>                    daddr = drm_fb_dma_get_gem_addr(state->fb,
>>>>>> state, i);
>>>>>> diff --git a/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> b/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> index d86773fa8ab0..49fe9b284cc8 100644
>>>>>> --- a/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> +++ b/drivers/gpu/drm/drm_fbdev_dma.c
>>>>>> @@ -131,7 +131,7 @@ static int
>>>>>> drm_fbdev_dma_helper_fb_probe(struct
>>>>>> drm_fb_helper *fb_helper,
>>>>>>     
>>>>>>            /* screen */
>>>>>>            info->flags |= FBINFO_VIRTFB; /* system memory */
>>>>>> -       if (dma_obj->map_noncoherent)
>>>>>> +       if (dma_obj->cached)
>>>>>>                    info->flags |= FBINFO_READS_FAST; /* signal
>>>>>> caching
>>>>>> */
>>>>>>            info->screen_size = sizes->surface_height * fb-
>>>>>>> pitches[0];
>>>>>>            info->screen_buffer = map.vaddr;
>>>>>> diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> index 870b90b78bc4..dec1d512bdf1 100644
>>>>>> --- a/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> +++ b/drivers/gpu/drm/drm_gem_dma_helper.c
>>>>>> @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device
>>>>>> *drm,
>>>>>> size_t size, bool private)
>>>>>>                    drm_gem_private_object_init(drm, gem_obj,
>>>>>> size);
>>>>>>     
>>>>>>                    /* Always use writecombine for dma-buf
>>>>>> mappings
>>>>>> */
>>>>>> -               dma_obj->map_noncoherent = false;
>>>>>> +               /* FIXME: This is not always true, on some
>>>>>> dma
>>>>>> coherent system,
>>>>>> +                * cached mappings should be preferred over
>>>>>> writecombine
>>>>>> +                */
>>>>>> +               dma_obj->cached = false;
>>>>>> +               dma_obj->coherent = false;
>>>>>>            } else {
>>>>>>                    ret = drm_gem_object_init(drm, gem_obj,
>>>>>> size);
>>>>>>            }
>>>>>> @@ -143,7 +147,11 @@ struct drm_gem_dma_object
>>>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>>>            if (IS_ERR(dma_obj))
>>>>>>                    return dma_obj;
>>>>>>     
>>>>>> -       if (dma_obj->map_noncoherent) {
>>>>>> +       if (dma_obj->cached && dma_obj->coherent) {
>>>>>> +               dma_obj->vaddr = dma_alloc_coherent(drm->dev,
>>>>>> size,
>>>>>> +                                                   &dma_obj-
>>>>>>> dma_addr,
>>>>>> +
>>>>>> GFP_KERNEL |
>>>>>> __GFP_NOWARN);
>>>>>> +       } else if (dma_obj->cached && !dma_obj->coherent) {
>>>>>>                    dma_obj->vaddr = dma_alloc_noncoherent(drm-
>>>>>>> dev,
>>>>>> size,
>>>>>>                                                          
>>>>>> &dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                          
>>>>>> DMA_TO_DEVICE,
>>>>>> @@ -153,6 +161,7 @@ struct drm_gem_dma_object
>>>>>> *drm_gem_dma_create(struct drm_device *drm,
>>>>>>                                                  &dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                  GFP_KERNEL |
>>>>>> __GFP_NOWARN);
>>>>>>            }
>>>>>> +
>>>>>>            if (!dma_obj->vaddr) {
>>>>>>                    drm_dbg(drm, "failed to allocate buffer
>>>>>> with
>>>>>> size
>>>>>> %zu\n",
>>>>>>                             size);
>>>>>> @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct
>>>>>> drm_gem_dma_object
>>>>>> *dma_obj)
>>>>>>                            dma_buf_vunmap_unlocked(gem_obj-
>>>>>>> import_attach->dmabuf, &map);
>>>>>>                    drm_prime_gem_destroy(gem_obj, dma_obj-
>>>>>>> sgt);
>>>>>>            } else if (dma_obj->vaddr) {
>>>>>> -               if (dma_obj->map_noncoherent)
>>>>>> +               if (dma_obj->cached && dma_obj->coherent)
>>>>>> +                       dma_free_coherent(gem_obj->dev->dev,
>>>>>> dma_obj-
>>>>>>> base.size,
>>>>>> +                                         dma_obj->vaddr,
>>>>>> dma_obj-
>>>>>>> dma_addr);
>>>>>> +               else if (dma_obj->cached && !dma_obj-
>>>>>>> coherent)
>>>>>>                            dma_free_noncoherent(gem_obj->dev-
>>>>>>> dev,
>>>>>> dma_obj->base.size,
>>>>>>                                                 dma_obj-
>>>>>>> vaddr,
>>>>>> dma_obj-
>>>>>>> dma_addr,
>>>>>>                                                
>>>>>> DMA_TO_DEVICE);
>>>>>> @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct
>>>>>> drm_gem_dma_object
>>>>>> *dma_obj, struct vm_area_struct *
>>>>>>            vma->vm_pgoff -= drm_vma_node_start(&obj-
>>>>>>> vma_node);
>>>>>>            vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
>>>>>>     
>>>>>> -       if (dma_obj->map_noncoherent) {
>>>>>> +       if (dma_obj->cached) {
>>>>>>                    vma->vm_page_prot = vm_get_page_prot(vma-
>>>>>>> vm_flags);
>>>>>>     
>>>>>>                    ret = dma_mmap_pages(dma_obj->base.dev-
>>>>>>> dev,
>>>>>> diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> index 5ec75e9ba499..a3df2f99a757 100644
>>>>>> --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
>>>>>> @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct
>>>>>> drm_device
>>>>>> *drm, size_t size)
>>>>>>            if (!obj)
>>>>>>                    return ERR_PTR(-ENOMEM);
>>>>>>     
>>>>>> -       obj->map_noncoherent = priv->soc_info-
>>>>>>> map_noncoherent;
>>>>>> +       if (priv->soc_info->map_noncoherent) {
>>>>>> +               obj->cached = true;
>>>>>> +               obj->coherent = false;
>>>>>> +       }
>>>>>>     
>>>>>>            return &obj->base;
>>>>>>     }
>>>>>> diff --git a/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> b/drivers/gpu/drm/rcar-
>>>>>> du/Kconfig
>>>>>> index 53c356aed5d5..dddc70c08bdc 100644
>>>>>> --- a/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> +++ b/drivers/gpu/drm/rcar-du/Kconfig
>>>>>> @@ -2,8 +2,6 @@
>>>>>>     config DRM_RCAR_DU
>>>>>>            tristate "DRM Support for R-Car Display Unit"
>>>>>>            depends on DRM && OF
>>>>>> -       depends on ARM || ARM64
>>>>>> -       depends on ARCH_RENESAS || COMPILE_TEST
>>>>>>            select DRM_KMS_HELPER
>>>>>>            select DRM_GEM_DMA_HELPER
>>>>>>            select VIDEOMODE_HELPERS
>>>>>> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> index adfb36b0e815..1142d51473e6 100644
>>>>>> --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
>>>>>> @@ -386,7 +386,9 @@ struct drm_gem_object
>>>>>> *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
>>>>>>            gem_obj->funcs = &rcar_du_gem_funcs;
>>>>>>     
>>>>>>            drm_gem_private_object_init(dev, gem_obj, attach-
>>>>>>> dmabuf-
>>>>>>> size);
>>>>>> -       dma_obj->map_noncoherent = false;
>>>>>> +
>>>>>> +       dma_obj->cached = false;
>>>>>> +       dma_obj->coherent = false;
>>>>>>     
>>>>>>            ret = drm_gem_create_mmap_offset(gem_obj);
>>>>>>            if (ret) {
>>>>>> diff --git a/include/drm/drm_gem_dma_helper.h
>>>>>> b/include/drm/drm_gem_dma_helper.h
>>>>>> index 8a043235dad8..585ce3d4d1eb 100644
>>>>>> --- a/include/drm/drm_gem_dma_helper.h
>>>>>> +++ b/include/drm/drm_gem_dma_helper.h
>>>>>> @@ -16,7 +16,9 @@ struct drm_mode_create_dumb;
>>>>>>      *       more than one entry but they are guaranteed to
>>>>>> have
>>>>>> contiguous
>>>>>>      *       DMA addresses.
>>>>>>      * @vaddr: kernel virtual address of the backing memory
>>>>>> - * @map_noncoherent: if true, the GEM object is backed by
>>>>>> non-
>>>>>> coherent memory
>>>>>> + * @cached: if true, the GEM object is backed by cached
>>>>>> memory
>>>>>> + * @coherent: This option only meaningful when a GEM object
>>>>>> is
>>>>>> cached.
>>>>>> + *            If true, Sync the GEM object for DMA access is
>>>>>> not
>>>>>> required.
>>>>>>      */
>>>>>>     struct drm_gem_dma_object {
>>>>>>            struct drm_gem_object base;
>>>>>> @@ -26,7 +28,8 @@ struct drm_gem_dma_object {
>>>>>>            /* For objects with DMA memory allocated by GEM DMA
>>>>>> */
>>>>>>            void *vaddr;
>>>>>>     
>>>>>> -       bool map_noncoherent;
>>>>>> +       bool cached;
>>>>>> +       bool coherent;
>>>>>>     };
>>>>>>     
>>>>>>     #define to_drm_gem_dma_obj(gem_obj) \
  
Sui Jingfeng June 22, 2023, 8:38 p.m. UTC | #13
Hi,

On 2023/6/8 15:39, Maxime Ripard wrote:
> On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
>> Hi,
>>
>> On 2023/6/8 00:12, Paul Cercueil wrote:
>>> Hi Sui,
>>>
>>> Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
>>>> Hi,  welcome to discussion.
>>>>
>>>>
>>>> I have limited skills in manipulating English.
>>>>
>>>> It may not express what I'm really means in the short time.
>>>>
>>>> Part of word in the sentence may not as accurate as your.
>>>>
>>>> Well, please don't misunderstand, I'm not doing the rude to you.
>>> No problem.
>>>
>>>> I will explain it with more details.
>>>>
>>>> See below:
>>>>
>>>>
>>>> On 2023/6/7 20:09, Paul Cercueil wrote:
>>>>> Hi Sui,
>>>>>
>>>>> Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On 2023/6/7 17:36, Paul Cercueil wrote:
>>>>>>> Hi Sui,
>>>>>>>
>>>>>>> Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
>>>>>>>> The single map_noncoherent member of struct
>>>>>>>> drm_gem_dma_object
>>>>>>>> may
>>>>>>>> not
>>>>>>>> sufficient for describing the backing memory of the GEM
>>>>>>>> buffer
>>>>>>>> object.
>>>>>>>>
>>>>>>>> Especially on dma-coherent systems, the backing memory is
>>>>>>>> both
>>>>>>>> cached
>>>>>>>> coherent for multi-core CPUs and dma-coherent for peripheral
>>>>>>>> device.
>>>>>>>> Say architectures like X86-64, LoongArch64, Loongson Mips64,
>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Whether a peripheral device is dma-coherent or not can be
>>>>>>>> implementation-dependent. The single map_noncoherent option
>>>>>>>> is
>>>>>>>> not
>>>>>>>> enough
>>>>>>>> to reflect real hardware anymore. For example, the Loongson
>>>>>>>> LS3A4000
>>>>>>>> CPU
>>>>>>>> and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
>>>>>>>> platform
>>>>>>>> allways snoop CPU's cache. Doing the allocation with
>>>>>>>> dma_alloc_coherent
>>>>>>>> function is preferred. The return buffer is cached, it should
>>>>>>>> not
>>>>>>>> using
>>>>>>>> the default write-combine mapping. While with the current
>>>>>>>> implement,
>>>>>>>> there
>>>>>>>> no way to tell the drm core to reflect this.
>>>>>>>>
>>>>>>>> This patch adds cached and coherent members to struct
>>>>>>>> drm_gem_dma_object.
>>>>>>>> which allow driver implements to inform the core. Introducing
>>>>>>>> new
>>>>>>>> mappings
>>>>>>>> while keeping the original default behavior unchanged.
>>>>>>> Did you try to simply set the "dma-coherent" property to the
>>>>>>> device's
>>>>>>> node?
>>>>>> But this approach can only be applied for the device driver with
>>>>>> DT
>>>>>> support.
>>>>>>
>>>>>> X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
>>>>>> do
>>>>>> not
>>>>>> have DT support.
>>>>>>
>>>>>> They using ACPI to pass parameter from the firmware to Linux
>>>>>> kernel.
>>>>>>
>>>>>> You approach will lost the effectiveness on such a case.
>>>>> Well, I don't really know how ACPI handles it - but it should just
>>>>> be a
>>>>> matter of setting dev->dma_coherent. That's basically what the DT
>>>>> code
>>>>> does.
>>>>>
>>>>> Some MIPS boards set it in their setup code for instance.
>>>>>
>>>> This is a *strategy*, not a *mechanism*.
>>>>
>>>> In this case, DT is just used to describing the hardware.
>>>>
>>>> (It is actually a hardware feature describing language, the
>>>> granularity
>>>> is large)
>>>>
>>>> It does not changing the state of the hardware.
>>>>
>>>> It's your platform firmware or kernel setting up code who actually do
>>>> such a things.
>>>>
>>>>
>>>> It's just that it works on *one* platform, it does not guarantee it
>>>> will
>>>> works on others.
>>> If you add the "dma-coherent" property in a device node in DT, you
>>> effectively specify that the device is DMA-coherent; so you describe
>>> the hardware, which is what DT is for, and you are not changing the
>>> state of the hardware.
>>>
>>> Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
>>> default to DMA-coherent mapping; I believe you could do something
>>> similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
>>>
>> The preblem is that device driver can have various demand.
>>
>> It probably want to create different kind of buffers for different thing
>> simultaneously.
>>
>> Say, one allocated with dma_alloc_coherent for command buffer or dma
>> descriptor
>>
>> another one allocated with  dma_alloc_wc for uploading shader etc.
>>
>> also has the third one allocated with dma_alloc_noncoherent() for doing some
>> else.
> And it will work just fine.
>
> struct device dma_coherent, or DT's dma-coherent property define that
> the device doesn't need any kind of cache maintenance, ever. If it's
> missing, we need to perform cache maintenance to keep coherency.
>
> dma_alloc_* functions provide guarantees to the driver. With
> dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
> you don't need to perform cache maintenance operations by hand in the
> driver.
>
> With dma_alloc_noncoherent, the buffer is non-coherent and the driver
> needs to perform them when relevant.
>
> How those buffers are created is platform specific, but the guarantees
> provided *to the driver* are always there.
>
> A buffer allocated with dma_alloc_coherent might be provided by
> different means (at the hardware level with a coherency unit, by mapping
> it non-cacheable), but as far as the driver is concerned it's always
> going to be coherent.
>
> Similarly, a driver using dma_alloc_noncoherent will always require
> cache maintenance operations to use the API properly, even if the
> hardware provides coherency (in which case, those operations will be
> nop).
>
> So, yeah, like I was saying in the other mail, it looks like you're
> confusing a bunch of things. dma_alloc_* functions are about the driver
> expectations and guarantees. DT's dma-coherent property is about how we
> can implement them on a given platform.
>
> They don't have to match, and that's precisely how we can have drivers
> that run on any combination of platforms: the driver only cares about
> the buffer guarantees, the platform description takes care of how they
> are implemented.

You are right in overall.

Yeah, you have better understanding than me.


But let me give you an example which may made people confusing:


The the drm/ingenic and drm/etnaviv (KMS-RO) as an example:



when drm/etnaviv's etnaviv_gem_prime_import_sg_table() function get called,

drm/etnaviv is importing buffer from drm/ingenic.

drm/etnaviv is the importer, and drm/ingenic is the exporter.

drm/ingenic choose non-coherent mapping by default for JZ4770(this is 
gc800 in it).

It's cached, for fast CPU software rendering.


While drm/etnaviv import the buffer, get the SG, using the WC mapping by 
default.

Dose this cause *cache aliasing* because of different driver using 
different cache

mapping for the same backing memory ?


Because the imported buffer originally belong to the KMS driver side.

For drm/ingenic(jz4770), the BO will be cached, but their hardware can't 
guarantee coherency.

when etnaviv finished the rendering, they will do the resolve.

By using the WC mapping, the GPU will write directly to the system RAM.


1)

If CPU flush the cache back to the system RAM(framebuffer when running 
glmark-es2-drm).

then the image(resolved by GPU) in framebuffer will be overwrite by the 
stall data in the cache.


2)

Think about occasions when we need the CPU to do the read access to the 
rendered image.

(snap shoot, or using with X server fake EXA)

The CPU still think the share buffer as cached, it will read from the 
cache if hit.

while GPU write to RAM directly by using WC mapping.


Even it call dma_sync_single_for_device(), it only get SYNC-ed for the 
device.

there is no SYNC for the CPU's cached.

I think, In the end, it will lost of coherency.


3)

If the user want to use X server graphic environment,

then the case will be more complex for 3D acceleration support.


Even it hacks somewhere to call the sync for CPU,

they still may need invalid the cache frequently.

In this case, it will not get a good performance.


At any case,  such a KMS-RO combination((cached no-coherent + WC)) will 
be a misery./
/

While drm/ingenic could give up the hardware acceleration and the 3D 
acceleration in X environment.

it's OK, as its for low-ended graphic application.


But, at the other hand, it is say also why arm soc adhere to the 
write-combine.

because they have no choice.

While ingenic is the first exception, thanks Paul's patch which help us 
to understand a lot thing .


4)

While Loongson LS2K1000 SoC is DMA-coherent,

We are also prefer cached framebuffer for fast CPU software rendering.

I also get the hardware accelerated 3D works successfully,

even only for the GL client (such as glxgears and glmark2).


Therefore, on our hardware platform.

I force both the KMS driver and RO drivers to use cached mapping.

with the hardware maintained cache coherence blessing us.

It turn out that is works.


We don't need ( and don't want ) to call the dma_sync_*() series function,

not because we don't know it is no-op for DMA-coherent.

It is because that we realized that it is not needed with 100% confident.


We want to keep the code clean.

We are ZERO tolerate to the not needed function calls.


> Maxime
  
Maxime Ripard June 26, 2023, 1:19 p.m. UTC | #14
On Fri, Jun 23, 2023 at 04:38:34AM +0800, Sui Jingfeng wrote:
> On 2023/6/8 15:39, Maxime Ripard wrote:
> > On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
> > > Hi,
> > > 
> > > On 2023/6/8 00:12, Paul Cercueil wrote:
> > > > Hi Sui,
> > > > 
> > > > Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
> > > > > Hi,  welcome to discussion.
> > > > > 
> > > > > 
> > > > > I have limited skills in manipulating English.
> > > > > 
> > > > > It may not express what I'm really means in the short time.
> > > > > 
> > > > > Part of word in the sentence may not as accurate as your.
> > > > > 
> > > > > Well, please don't misunderstand, I'm not doing the rude to you.
> > > > No problem.
> > > > 
> > > > > I will explain it with more details.
> > > > > 
> > > > > See below:
> > > > > 
> > > > > 
> > > > > On 2023/6/7 20:09, Paul Cercueil wrote:
> > > > > > Hi Sui,
> > > > > > 
> > > > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > 
> > > > > > > On 2023/6/7 17:36, Paul Cercueil wrote:
> > > > > > > > Hi Sui,
> > > > > > > > 
> > > > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > > > > > > > The single map_noncoherent member of struct
> > > > > > > > > drm_gem_dma_object
> > > > > > > > > may
> > > > > > > > > not
> > > > > > > > > sufficient for describing the backing memory of the GEM
> > > > > > > > > buffer
> > > > > > > > > object.
> > > > > > > > > 
> > > > > > > > > Especially on dma-coherent systems, the backing memory is
> > > > > > > > > both
> > > > > > > > > cached
> > > > > > > > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > > > > > > > device.
> > > > > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64,
> > > > > > > > > etc.
> > > > > > > > > 
> > > > > > > > > Whether a peripheral device is dma-coherent or not can be
> > > > > > > > > implementation-dependent. The single map_noncoherent option
> > > > > > > > > is
> > > > > > > > > not
> > > > > > > > > enough
> > > > > > > > > to reflect real hardware anymore. For example, the Loongson
> > > > > > > > > LS3A4000
> > > > > > > > > CPU
> > > > > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > > > > > > > platform
> > > > > > > > > allways snoop CPU's cache. Doing the allocation with
> > > > > > > > > dma_alloc_coherent
> > > > > > > > > function is preferred. The return buffer is cached, it should
> > > > > > > > > not
> > > > > > > > > using
> > > > > > > > > the default write-combine mapping. While with the current
> > > > > > > > > implement,
> > > > > > > > > there
> > > > > > > > > no way to tell the drm core to reflect this.
> > > > > > > > > 
> > > > > > > > > This patch adds cached and coherent members to struct
> > > > > > > > > drm_gem_dma_object.
> > > > > > > > > which allow driver implements to inform the core. Introducing
> > > > > > > > > new
> > > > > > > > > mappings
> > > > > > > > > while keeping the original default behavior unchanged.
> > > > > > > > Did you try to simply set the "dma-coherent" property to the
> > > > > > > > device's
> > > > > > > > node?
> > > > > > > But this approach can only be applied for the device driver with
> > > > > > > DT
> > > > > > > support.
> > > > > > > 
> > > > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
> > > > > > > do
> > > > > > > not
> > > > > > > have DT support.
> > > > > > > 
> > > > > > > They using ACPI to pass parameter from the firmware to Linux
> > > > > > > kernel.
> > > > > > > 
> > > > > > > You approach will lost the effectiveness on such a case.
> > > > > > Well, I don't really know how ACPI handles it - but it should just
> > > > > > be a
> > > > > > matter of setting dev->dma_coherent. That's basically what the DT
> > > > > > code
> > > > > > does.
> > > > > > 
> > > > > > Some MIPS boards set it in their setup code for instance.
> > > > > > 
> > > > > This is a *strategy*, not a *mechanism*.
> > > > > 
> > > > > In this case, DT is just used to describing the hardware.
> > > > > 
> > > > > (It is actually a hardware feature describing language, the
> > > > > granularity
> > > > > is large)
> > > > > 
> > > > > It does not changing the state of the hardware.
> > > > > 
> > > > > It's your platform firmware or kernel setting up code who actually do
> > > > > such a things.
> > > > > 
> > > > > 
> > > > > It's just that it works on *one* platform, it does not guarantee it
> > > > > will
> > > > > works on others.
> > > > If you add the "dma-coherent" property in a device node in DT, you
> > > > effectively specify that the device is DMA-coherent; so you describe
> > > > the hardware, which is what DT is for, and you are not changing the
> > > > state of the hardware.
> > > > 
> > > > Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> > > > default to DMA-coherent mapping; I believe you could do something
> > > > similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
> > > > 
> > > The preblem is that device driver can have various demand.
> > > 
> > > It probably want to create different kind of buffers for different thing
> > > simultaneously.
> > > 
> > > Say, one allocated with dma_alloc_coherent for command buffer or dma
> > > descriptor
> > > 
> > > another one allocated with  dma_alloc_wc for uploading shader etc.
> > > 
> > > also has the third one allocated with dma_alloc_noncoherent() for doing some
> > > else.
> > And it will work just fine.
> > 
> > struct device dma_coherent, or DT's dma-coherent property define that
> > the device doesn't need any kind of cache maintenance, ever. If it's
> > missing, we need to perform cache maintenance to keep coherency.
> > 
> > dma_alloc_* functions provide guarantees to the driver. With
> > dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
> > you don't need to perform cache maintenance operations by hand in the
> > driver.
> > 
> > With dma_alloc_noncoherent, the buffer is non-coherent and the driver
> > needs to perform them when relevant.
> > 
> > How those buffers are created is platform specific, but the guarantees
> > provided *to the driver* are always there.
> > 
> > A buffer allocated with dma_alloc_coherent might be provided by
> > different means (at the hardware level with a coherency unit, by mapping
> > it non-cacheable), but as far as the driver is concerned it's always
> > going to be coherent.
> > 
> > Similarly, a driver using dma_alloc_noncoherent will always require
> > cache maintenance operations to use the API properly, even if the
> > hardware provides coherency (in which case, those operations will be
> > nop).
> > 
> > So, yeah, like I was saying in the other mail, it looks like you're
> > confusing a bunch of things. dma_alloc_* functions are about the driver
> > expectations and guarantees. DT's dma-coherent property is about how we
> > can implement them on a given platform.
> > 
> > They don't have to match, and that's precisely how we can have drivers
> > that run on any combination of platforms: the driver only cares about
> > the buffer guarantees, the platform description takes care of how they
> > are implemented.
> 
> You are right in overall.
> 
> Yeah, you have better understanding than me.
> 
> 
> But let me give you an example which may made people confusing:
> 
> 
> The the drm/ingenic and drm/etnaviv (KMS-RO) as an example:
> 
> 
> 
> when drm/etnaviv's etnaviv_gem_prime_import_sg_table() function get called,
> 
> drm/etnaviv is importing buffer from drm/ingenic.
> 
> drm/etnaviv is the importer, and drm/ingenic is the exporter.
> 
> drm/ingenic choose non-coherent mapping by default for JZ4770(this is gc800
> in it).
> 
> It's cached, for fast CPU software rendering.
> 
> 
> While drm/etnaviv import the buffer, get the SG, using the WC mapping by
> default.
> 
> Dose this cause *cache aliasing* because of different driver using different
> cache
> 
> mapping for the same backing memory ?

This is a slightly different problem though. The main issue here is
that there's multiple mapping with different attributes. Why is
etnaviv even mapping the KMS buffer in the first place?

I'd say it's largely a dma-buf problem if that actually happens.

> Because the imported buffer originally belong to the KMS driver side.
> 
> For drm/ingenic(jz4770), the BO will be cached, but their hardware can't
> guarantee coherency.

You don't need to have hardware coherency to have a coherent buffer. A
buffer mapped non-cacheable is coherent.

> when etnaviv finished the rendering, they will do the resolve.
> 
> By using the WC mapping, the GPU will write directly to the system RAM.

I'm confused. The WC mapping is for the *CPU* mapping. The GPU doesn't
use the CPU mapping.

> 1)
> 
> If CPU flush the cache back to the system RAM(framebuffer when running
> glmark-es2-drm).
> 
> then the image(resolved by GPU) in framebuffer will be overwrite by the
> stall data in the cache.
> 
> 
> 2)
> 
> Think about occasions when we need the CPU to do the read access to the
> rendered image.
> 
> (snap shoot, or using with X server fake EXA)
> 
> The CPU still think the share buffer as cached, it will read from the cache
> if hit.
> 
> while GPU write to RAM directly by using WC mapping.
> 
> 
> Even it call dma_sync_single_for_device(), it only get SYNC-ed for the
> device.
> 
> there is no SYNC for the CPU's cached.
> 
> I think, In the end, it will lost of coherency.
> 
> 
> 3)
> 
> If the user want to use X server graphic environment,
> 
> then the case will be more complex for 3D acceleration support.
> 
> 
> Even it hacks somewhere to call the sync for CPU,
> 
> they still may need invalid the cache frequently.
> 
> In this case, it will not get a good performance.
> 
> 
> At any case,  such a KMS-RO combination((cached no-coherent + WC)) will be a
> misery./
> /
> 
> While drm/ingenic could give up the hardware acceleration and the 3D
> acceleration in X environment.
> 
> it's OK, as its for low-ended graphic application.
> 
> 
> But, at the other hand, it is say also why arm soc adhere to the
> write-combine.
> 
> because they have no choice.
> 
> While ingenic is the first exception, thanks Paul's patch which help us to
> understand a lot thing .
> 
> 
> 4)
> 
> While Loongson LS2K1000 SoC is DMA-coherent,
> 
> We are also prefer cached framebuffer for fast CPU software rendering.
> 
> I also get the hardware accelerated 3D works successfully,
> 
> even only for the GL client (such as glxgears and glmark2).
> 
> 
> Therefore, on our hardware platform.
> 
> I force both the KMS driver and RO drivers to use cached mapping.
> 
> with the hardware maintained cache coherence blessing us.
> 
> It turn out that is works.
> 
> 
> We don't need ( and don't want ) to call the dma_sync_*() series function,
> not because we don't know it is no-op for DMA-coherent.

Then don't do that? Ingenic is the only driver that does. That means
that all the other drivers don't, so follow their lead instead of
ingenic if the trade-off doesn't work for you.

Maxime
  

Patch

diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c b/drivers/gpu/drm/drm_fb_dma_helper.c
index 3b535ad1b07c..93ff05041192 100644
--- a/drivers/gpu/drm/drm_fb_dma_helper.c
+++ b/drivers/gpu/drm/drm_fb_dma_helper.c
@@ -106,16 +106,15 @@  dma_addr_t drm_fb_dma_get_gem_addr(struct drm_framebuffer *fb,
 EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr);
 
 /**
- * drm_fb_dma_sync_non_coherent - Sync GEM object to non-coherent backing
- *	memory
+ * drm_fb_dma_sync_non_coherent - Sync GEM object to cached backing memory
  * @drm: DRM device
  * @old_state: Old plane state
  * @state: New plane state
  *
  * This function can be used by drivers that use damage clips and have
- * DMA GEM objects backed by non-coherent memory. Calling this function
- * in a plane's .atomic_update ensures that all the data in the backing
- * memory have been written to RAM.
+ * DMA GEM objects backed by cached memory. Calling this function in a
+ * plane's .atomic_update ensures that all the data in the backing memory
+ * have been written to RAM.
  */
 void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
 				  struct drm_plane_state *old_state,
@@ -131,7 +130,7 @@  void drm_fb_dma_sync_non_coherent(struct drm_device *drm,
 
 	for (i = 0; i < finfo->num_planes; i++) {
 		dma_obj = drm_fb_dma_get_gem_obj(state->fb, i);
-		if (!dma_obj->map_noncoherent)
+		if (dma_obj->cached && dma_obj->coherent)
 			continue;
 
 		daddr = drm_fb_dma_get_gem_addr(state->fb, state, i);
diff --git a/drivers/gpu/drm/drm_fbdev_dma.c b/drivers/gpu/drm/drm_fbdev_dma.c
index d86773fa8ab0..49fe9b284cc8 100644
--- a/drivers/gpu/drm/drm_fbdev_dma.c
+++ b/drivers/gpu/drm/drm_fbdev_dma.c
@@ -131,7 +131,7 @@  static int drm_fbdev_dma_helper_fb_probe(struct drm_fb_helper *fb_helper,
 
 	/* screen */
 	info->flags |= FBINFO_VIRTFB; /* system memory */
-	if (dma_obj->map_noncoherent)
+	if (dma_obj->cached)
 		info->flags |= FBINFO_READS_FAST; /* signal caching */
 	info->screen_size = sizes->surface_height * fb->pitches[0];
 	info->screen_buffer = map.vaddr;
diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c
index 870b90b78bc4..dec1d512bdf1 100644
--- a/drivers/gpu/drm/drm_gem_dma_helper.c
+++ b/drivers/gpu/drm/drm_gem_dma_helper.c
@@ -93,7 +93,11 @@  __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private)
 		drm_gem_private_object_init(drm, gem_obj, size);
 
 		/* Always use writecombine for dma-buf mappings */
-		dma_obj->map_noncoherent = false;
+		/* FIXME: This is not always true, on some dma coherent system,
+		 * cached mappings should be preferred over writecombine
+		 */
+		dma_obj->cached = false;
+		dma_obj->coherent = false;
 	} else {
 		ret = drm_gem_object_init(drm, gem_obj, size);
 	}
@@ -143,7 +147,11 @@  struct drm_gem_dma_object *drm_gem_dma_create(struct drm_device *drm,
 	if (IS_ERR(dma_obj))
 		return dma_obj;
 
-	if (dma_obj->map_noncoherent) {
+	if (dma_obj->cached && dma_obj->coherent) {
+		dma_obj->vaddr = dma_alloc_coherent(drm->dev, size,
+						    &dma_obj->dma_addr,
+						    GFP_KERNEL | __GFP_NOWARN);
+	} else if (dma_obj->cached && !dma_obj->coherent) {
 		dma_obj->vaddr = dma_alloc_noncoherent(drm->dev, size,
 						       &dma_obj->dma_addr,
 						       DMA_TO_DEVICE,
@@ -153,6 +161,7 @@  struct drm_gem_dma_object *drm_gem_dma_create(struct drm_device *drm,
 					      &dma_obj->dma_addr,
 					      GFP_KERNEL | __GFP_NOWARN);
 	}
+
 	if (!dma_obj->vaddr) {
 		drm_dbg(drm, "failed to allocate buffer with size %zu\n",
 			 size);
@@ -233,7 +242,10 @@  void drm_gem_dma_free(struct drm_gem_dma_object *dma_obj)
 			dma_buf_vunmap_unlocked(gem_obj->import_attach->dmabuf, &map);
 		drm_prime_gem_destroy(gem_obj, dma_obj->sgt);
 	} else if (dma_obj->vaddr) {
-		if (dma_obj->map_noncoherent)
+		if (dma_obj->cached && dma_obj->coherent)
+			dma_free_coherent(gem_obj->dev->dev, dma_obj->base.size,
+					  dma_obj->vaddr, dma_obj->dma_addr);
+		else if (dma_obj->cached && !dma_obj->coherent)
 			dma_free_noncoherent(gem_obj->dev->dev, dma_obj->base.size,
 					     dma_obj->vaddr, dma_obj->dma_addr,
 					     DMA_TO_DEVICE);
@@ -532,7 +544,7 @@  int drm_gem_dma_mmap(struct drm_gem_dma_object *dma_obj, struct vm_area_struct *
 	vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node);
 	vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP);
 
-	if (dma_obj->map_noncoherent) {
+	if (dma_obj->cached) {
 		vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
 		ret = dma_mmap_pages(dma_obj->base.dev->dev,
diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
index 5ec75e9ba499..a3df2f99a757 100644
--- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
+++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c
@@ -919,7 +919,10 @@  ingenic_drm_gem_create_object(struct drm_device *drm, size_t size)
 	if (!obj)
 		return ERR_PTR(-ENOMEM);
 
-	obj->map_noncoherent = priv->soc_info->map_noncoherent;
+	if (priv->soc_info->map_noncoherent) {
+		obj->cached = true;
+		obj->coherent = false;
+	}
 
 	return &obj->base;
 }
diff --git a/drivers/gpu/drm/rcar-du/Kconfig b/drivers/gpu/drm/rcar-du/Kconfig
index 53c356aed5d5..dddc70c08bdc 100644
--- a/drivers/gpu/drm/rcar-du/Kconfig
+++ b/drivers/gpu/drm/rcar-du/Kconfig
@@ -2,8 +2,6 @@ 
 config DRM_RCAR_DU
 	tristate "DRM Support for R-Car Display Unit"
 	depends on DRM && OF
-	depends on ARM || ARM64
-	depends on ARCH_RENESAS || COMPILE_TEST
 	select DRM_KMS_HELPER
 	select DRM_GEM_DMA_HELPER
 	select VIDEOMODE_HELPERS
diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
index adfb36b0e815..1142d51473e6 100644
--- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c
+++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c
@@ -386,7 +386,9 @@  struct drm_gem_object *rcar_du_gem_prime_import_sg_table(struct drm_device *dev,
 	gem_obj->funcs = &rcar_du_gem_funcs;
 
 	drm_gem_private_object_init(dev, gem_obj, attach->dmabuf->size);
-	dma_obj->map_noncoherent = false;
+
+	dma_obj->cached = false;
+	dma_obj->coherent = false;
 
 	ret = drm_gem_create_mmap_offset(gem_obj);
 	if (ret) {
diff --git a/include/drm/drm_gem_dma_helper.h b/include/drm/drm_gem_dma_helper.h
index 8a043235dad8..585ce3d4d1eb 100644
--- a/include/drm/drm_gem_dma_helper.h
+++ b/include/drm/drm_gem_dma_helper.h
@@ -16,7 +16,9 @@  struct drm_mode_create_dumb;
  *       more than one entry but they are guaranteed to have contiguous
  *       DMA addresses.
  * @vaddr: kernel virtual address of the backing memory
- * @map_noncoherent: if true, the GEM object is backed by non-coherent memory
+ * @cached: if true, the GEM object is backed by cached memory
+ * @coherent: This option only meaningful when a GEM object is cached.
+ *            If true, Sync the GEM object for DMA access is not required.
  */
 struct drm_gem_dma_object {
 	struct drm_gem_object base;
@@ -26,7 +28,8 @@  struct drm_gem_dma_object {
 	/* For objects with DMA memory allocated by GEM DMA */
 	void *vaddr;
 
-	bool map_noncoherent;
+	bool cached;
+	bool coherent;
 };
 
 #define to_drm_gem_dma_obj(gem_obj) \