drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog

Message ID 20231030-sc8280xp-dpu-safe-lut-v1-1-6d485d7b428f@quicinc.com
State New
Headers
Series drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog |

Commit Message

Bjorn Andersson Oct. 30, 2023, 11:23 p.m. UTC
  During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
typically take 1-2ms to complete. As expected this results in poor
performance, something that has been mitigated by proposing running the
iommu in non-strict mode (boot with iommu.strict=0).

This turns out to be related to the SAFE logic, and programming the QOS
SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
TLB sync time to below 10us, which means significant less time spent
with interrupts disabled and a significant boost in throughput.

Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP")
Cc: stable@vger.kernel.org
Suggested-by: Doug Anderson <dianders@chromium.org>
Suggested-by: Rob Clark <robdclark@chromium.org>
Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
---
 drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 +
 1 file changed, 1 insertion(+)


---
base-commit: c503e3eec382ac708ee7adf874add37b77c5d312
change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452

Best regards,
  

Comments

Manivannan Sadhasivam Oct. 31, 2023, 8:19 a.m. UTC | #1
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
> 
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.
> 
> Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP")
> Cc: stable@vger.kernel.org
> Suggested-by: Doug Anderson <dianders@chromium.org>
> Suggested-by: Rob Clark <robdclark@chromium.org>
> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> ---
>  drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> index 1ccd1edd693c..4c0528794e7a 100644
> --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = {
>  	.min_llcc_ib = 0,
>  	.min_dram_ib = 800000,
>  	.danger_lut_tbl = {0xf, 0xffff, 0x0},
> +	.safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},

What does these values represent? And how SAFE is to override the default QoS
values?

I'm not too familiar with the MSM DRM driver, so please excuse my ignorance.

- Mani

>  	.qos_lut_tbl = {
>  		{.nentry = ARRAY_SIZE(sc8180x_qos_linear),
>  		.entries = sc8180x_qos_linear
> 
> ---
> base-commit: c503e3eec382ac708ee7adf874add37b77c5d312
> change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
> 
> Best regards,
> -- 
> Bjorn Andersson <quic_bjorande@quicinc.com>
>
  
Johan Hovold Oct. 31, 2023, 12:35 p.m. UTC | #2
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
> 
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.

I ran some tests with a gigabit ethernet adapter to get an idea of how
this performs in comparison to using lazy iommu mode ("non-strict"):

		6.6	6.6-lazy	6.6-dpu		6.6-dpu-lazy
iperf3 recv	114	941		941		941		MBit/s
iperf3 send	124	891		703		940		MBit/s

scp recv	14.6	110		110		111		MB/s
scp send	12.5	98.9		91.5		110		MB/s

This patch in itself indeed improves things quite a bit, but there is
still some performance that can be gained by using lazy iommu mode.

Notably, lazy mode with this patch applied appears to saturate the link
in both directions.

Tested-by: Johan Hovold <johan+linaro@kernel.org>

Johan
  
Rob Clark Oct. 31, 2023, 12:46 p.m. UTC | #3
On Tue, Oct 31, 2023 at 1:19 AM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
> > During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> > typically take 1-2ms to complete. As expected this results in poor
> > performance, something that has been mitigated by proposing running the
> > iommu in non-strict mode (boot with iommu.strict=0).
> >
> > This turns out to be related to the SAFE logic, and programming the QOS
> > SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> > TLB sync time to below 10us, which means significant less time spent
> > with interrupts disabled and a significant boost in throughput.
> >
> > Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP")
> > Cc: stable@vger.kernel.org
> > Suggested-by: Doug Anderson <dianders@chromium.org>
> > Suggested-by: Rob Clark <robdclark@chromium.org>
> > Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> > ---
> >  drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> > index 1ccd1edd693c..4c0528794e7a 100644
> > --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> > +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> > @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = {
> >       .min_llcc_ib = 0,
> >       .min_dram_ib = 800000,
> >       .danger_lut_tbl = {0xf, 0xffff, 0x0},
> > +     .safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
>
> What does these values represent? And how SAFE is to override the default QoS
> values?
>
> I'm not too familiar with the MSM DRM driver, so please excuse my ignorance.

for realtime dma (like scanout) there is a sort of "safe" signal from
the dma master to the smmu to indicate when it has enough data
buffered for it to be safe to do tlbinv without risking underflow.
When things aren't "safe" the smmu will stall tlbinv.  This is just
configuring the thresholds for the "safe" signal.

BR,
-R

> - Mani
>
> >       .qos_lut_tbl = {
> >               {.nentry = ARRAY_SIZE(sc8180x_qos_linear),
> >               .entries = sc8180x_qos_linear
> >
> > ---
> > base-commit: c503e3eec382ac708ee7adf874add37b77c5d312
> > change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
> >
> > Best regards,
> > --
> > Bjorn Andersson <quic_bjorande@quicinc.com>
> >
>
> --
> மணிவண்ணன் சதாசிவம்
  
Rob Clark Oct. 31, 2023, 12:47 p.m. UTC | #4
On Tue, Oct 31, 2023 at 5:35 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote:
> > During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> > typically take 1-2ms to complete. As expected this results in poor
> > performance, something that has been mitigated by proposing running the
> > iommu in non-strict mode (boot with iommu.strict=0).
> >
> > This turns out to be related to the SAFE logic, and programming the QOS
> > SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> > TLB sync time to below 10us, which means significant less time spent
> > with interrupts disabled and a significant boost in throughput.
>
> I ran some tests with a gigabit ethernet adapter to get an idea of how
> this performs in comparison to using lazy iommu mode ("non-strict"):
>
>                 6.6     6.6-lazy        6.6-dpu         6.6-dpu-lazy
> iperf3 recv     114     941             941             941             MBit/s
> iperf3 send     124     891             703             940             MBit/s
>
> scp recv        14.6    110             110             111             MB/s
> scp send        12.5    98.9            91.5            110             MB/s
>
> This patch in itself indeed improves things quite a bit, but there is
> still some performance that can be gained by using lazy iommu mode.
>
> Notably, lazy mode with this patch applied appears to saturate the link
> in both directions.

Maybe there is still room for SoC specific udev rules so dma masters
without firmware can be configured as "lazy", ie. like:

https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/refs/heads/main/baseboard-trogdor/chromeos-base/chromeos-bsp-baseboard-trogdor/files/98-qcom-nonstrict-iommu.rules

BR,
-R

> Tested-by: Johan Hovold <johan+linaro@kernel.org>
>
> Johan
  
Steev Klimaszewski Oct. 31, 2023, 6:12 p.m. UTC | #5
On Mon, Oct 30, 2023 at 6:23 PM Bjorn Andersson
<quic_bjorande@quicinc.com> wrote:
>
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
>
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.
>
> Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP")
> Cc: stable@vger.kernel.org
> Suggested-by: Doug Anderson <dianders@chromium.org>
> Suggested-by: Rob Clark <robdclark@chromium.org>
> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> ---
>  drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> index 1ccd1edd693c..4c0528794e7a 100644
> --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = {
>         .min_llcc_ib = 0,
>         .min_dram_ib = 800000,
>         .danger_lut_tbl = {0xf, 0xffff, 0x0},
> +       .safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
>         .qos_lut_tbl = {
>                 {.nentry = ARRAY_SIZE(sc8180x_qos_linear),
>                 .entries = sc8180x_qos_linear
>
> ---
> base-commit: c503e3eec382ac708ee7adf874add37b77c5d312
> change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
>
> Best regards,
> --
> Bjorn Andersson <quic_bjorande@quicinc.com>
>

Tested-by: Steev Klimaszewski <steev@kali.org>
  
Abhinav Kumar Nov. 15, 2023, 6:41 p.m. UTC | #6
On 10/30/2023 4:23 PM, Bjorn Andersson wrote:
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
> 
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.
> 
> Fixes: 4a352c2fc15a ("drm/msm/dpu: Introduce SC8280XP")
> Cc: stable@vger.kernel.org
> Suggested-by: Doug Anderson <dianders@chromium.org>
> Suggested-by: Rob Clark <robdclark@chromium.org>
> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> ---

Matches what we have in downstream DT, hence

Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com>

>   drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> index 1ccd1edd693c..4c0528794e7a 100644
> --- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> +++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
> @@ -406,6 +406,7 @@ static const struct dpu_perf_cfg sc8280xp_perf_data = {
>   	.min_llcc_ib = 0,
>   	.min_dram_ib = 800000,
>   	.danger_lut_tbl = {0xf, 0xffff, 0x0},
> +	.safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
>   	.qos_lut_tbl = {
>   		{.nentry = ARRAY_SIZE(sc8180x_qos_linear),
>   		.entries = sc8180x_qos_linear
> 
> ---
> base-commit: c503e3eec382ac708ee7adf874add37b77c5d312
> change-id: 20231030-sc8280xp-dpu-safe-lut-9769027b8452
> 
> Best regards,
  
Abhinav Kumar Nov. 21, 2023, 6:40 p.m. UTC | #7
On Mon, 30 Oct 2023 16:23:20 -0700, Bjorn Andersson wrote:
> During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to
> typically take 1-2ms to complete. As expected this results in poor
> performance, something that has been mitigated by proposing running the
> iommu in non-strict mode (boot with iommu.strict=0).
> 
> This turns out to be related to the SAFE logic, and programming the QOS
> SAFE values in the DPU (per suggestion from Rob and Doug) reduces the
> TLB sync time to below 10us, which means significant less time spent
> with interrupts disabled and a significant boost in throughput.
> 
> [...]

Applied, thanks!

[1/1] drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog
      https://gitlab.freedesktop.org/drm/msm/-/commit/a33b2431d11b

Best regards,
  

Patch

diff --git a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
index 1ccd1edd693c..4c0528794e7a 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
+++ b/drivers/gpu/drm/msm/disp/dpu1/catalog/dpu_8_0_sc8280xp.h
@@ -406,6 +406,7 @@  static const struct dpu_perf_cfg sc8280xp_perf_data = {
 	.min_llcc_ib = 0,
 	.min_dram_ib = 800000,
 	.danger_lut_tbl = {0xf, 0xffff, 0x0},
+	.safe_lut_tbl = {0xfe00, 0xfe00, 0xffff},
 	.qos_lut_tbl = {
 		{.nentry = ARRAY_SIZE(sc8180x_qos_linear),
 		.entries = sc8180x_qos_linear