Message ID | 20240208-fd-dpu-debug-timeout-v2-1-9f907f1bdd87@linaro.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:50ea:b0:106:860b:bbdd with SMTP id r10csp219926dyd; Thu, 8 Feb 2024 06:50:54 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCV/+nQlIaZfh0hF2Q6QzXelZ6/sTxHLy4fcBYZtopnY3bAKR8xbu4c4px6NcETBV48aVnDkwvfkBvcoz8xzmWrZQ8orJg== X-Google-Smtp-Source: AGHT+IEW5QEI1ysWCOrBTgXAhfuPJUOgS7YpB6v2NrQW2ZZCSSDrRIiBuTNz/yC9nizIMaz/HAzW X-Received: by 2002:aa7:d842:0:b0:560:29aa:f4a5 with SMTP id f2-20020aa7d842000000b0056029aaf4a5mr6391754eds.37.1707403854789; Thu, 08 Feb 2024 06:50:54 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707403854; cv=pass; d=google.com; s=arc-20160816; b=F51YJchxtwTzSAFv2SF+MehQQ0OTvNaRXFRpKxD0ymwPYhVoaYlLk4FrEcMB21Rjpm bpDvwZlerDA5BTPkn6x3TYVuiHzSZ0f95q6x4BxIV9Y4p/Pn8AyzjS56TYFgXjnKPRZv 1IcwUXx4qH1f7Oz1QUo20ulrKnK/uFJsBE21pfSEuGO9GrHxlX2ETwXoojiFlaqt9ZPX K4C5NNDBfcVLEJaEVUXjvf9ukwruj5FZMF1hL11BD++frW2D+5OWN5nkJMAE+82+xmpa CgGbOLayU2CNTE0EQyaxGwOAlIF4GSRquopy+O5OhvoDNs9rVMi2ZCaJJ16CFWui9r9g yCwA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:message-id:content-transfer-encoding:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:subject:date :from:dkim-signature; bh=isFytFDXARzCh2Jm6b6WyMr5DRIgqurmR16IdjoLmWw=; fh=7/zGOy0LjPn3Wge2ACgZwZzP4ZLSjXGSNS/3m9LcD3Q=; b=oJ8XU0aI67OPybHzvabF0kNXa+wc1sNfavCaeEra852f48gL2QZKQj3AEZC5srdQwY IhsHKyVwmrQSKFEqhUmIak9jBAzsR2xF04pT2d6Pp5i8/YaR/hgPsVughYIUFFm853zo 20IHdTYFdYim/dGQCI6g0/bZXYLBCFuE5y1VFpFZBocz2IZoQuj0PRVV+1OTZQMh5Zse I4lFWns8zXP2uttu6AzOWD157yahx9ITW3cuZOafY/TZyaYK3hio4Uxc9LMvWGFbip0Q PAWGJbKzfG7FX7NBU9VhI7IchE2a/C1KQYQQMGCQvaeKY110WrV4E7ir4s2UsuOci/dT /3IQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZYWDF5v3; arc=pass (i=1 spf=pass spfdomain=linaro.org dkim=pass dkdomain=linaro.org dmarc=pass fromdomain=linaro.org); spf=pass (google.com: domain of linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org X-Forwarded-Encrypted: i=2; AJvYcCW6oBi16zJqYf/lJYUUa+lFmGHC32CdkOFY6ZTSloyS0MNaqmEROBz2e3u22nF6dw2mXMxqls+IcCJvyZUqMeG3p6RlXA== Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id b24-20020aa7df98000000b0055fea5dd2b9si969371edy.402.2024.02.08.06.50.54 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Feb 2024 06:50:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZYWDF5v3; arc=pass (i=1 spf=pass spfdomain=linaro.org dkim=pass dkdomain=linaro.org dmarc=pass fromdomain=linaro.org); spf=pass (google.com: domain of linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-58248-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 3991B1F272F9 for <ouuuleilei@gmail.com>; Thu, 8 Feb 2024 14:50:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BA03E7B3CC; Thu, 8 Feb 2024 14:50:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="ZYWDF5v3" Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA47067A01 for <linux-kernel@vger.kernel.org>; Thu, 8 Feb 2024 14:50:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707403835; cv=none; b=bTYJIh9E4NIx3TcuOhtYwk+uZknHZZ7FPR0J0q0WvrRvZGWXVkhIvhsI2mRJY3C53lxLeStCJbTuJF+SEKmZjS9Yqu+03j+1feBkrEvM/nhrpIOEO8Oludj78tsdctjw6uQPrtNP/EDhbhLAx/3II45dtU2qUuB5usrLVavgPPE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707403835; c=relaxed/simple; bh=x/YVZfjdNfpe+Sx6imXL9gORrTESWB+MUbneEnPObG4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:Cc; b=IbEzlLEfy7YK7ZQ6EDSfJi6rfJufccLcngUX8tp1x5sqsFPd407L7uXRHPkX+pEgo+HaTWlgLaRKqaq5vekeEEXTGWcH0tfyNBvAlazWP++xeWMm/T1OuQvpAemejRqYTCCrLhVuu7ByYBym/vwPZuZIUSE/v1MDUXozfA3MIXk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org; spf=pass smtp.mailfrom=linaro.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b=ZYWDF5v3; arc=none smtp.client-ip=209.85.167.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-5101cd91017so2224087e87.2 for <linux-kernel@vger.kernel.org>; Thu, 08 Feb 2024 06:50:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1707403832; x=1708008632; darn=vger.kernel.org; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:from:to:cc:subject:date:message-id:reply-to; bh=isFytFDXARzCh2Jm6b6WyMr5DRIgqurmR16IdjoLmWw=; b=ZYWDF5v3IdJ4ij+F4pMxK0TSEOjpIiARZHLwGjv0UOkGRm2sVFoLtN7OckQhP89Apm djuYs8i91Rdx1ZWyfhYYh/jiMiKeam9z7XFW4qS6Z9kqoGa2dw+7yfoI3MudEkhU37iR 623Jyvb/ftq5TA1vFiIWrccOAycNwhQVH/5dPxNuhINaC5J7YlyoWHriApCicXc+na4I GH0WF5UTujG++lDWG6yiyFproZnRIlloT7Lbn2E6QU3qrrRY+qK/d/BJgRw9QhdfMOem w386qHcW3mT6LHblB2baJE3tm3orYeNL//TH004tbaa9+xsfQ9IvaUwQo7q5Sb6ASexH E33w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707403832; x=1708008632; h=cc:to:message-id:content-transfer-encoding:mime-version:subject :date:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=isFytFDXARzCh2Jm6b6WyMr5DRIgqurmR16IdjoLmWw=; b=A1r7JfjgGGJQ9dg7KjQ2zj+GzTxTMuv7vS7ZrmRW/ZdRV7dX0NfdKoXBHO7eyKAgqj GXeaZBn4W9BnewiOyxWZEwzK6DO5BjidBk/kTpfSNqtTGYgYDJ7D39TI/1GUlDdLP3jO YrBVS/YBQvlFDif/TF757xJn9BKiSKyV1glhUaW2bV5kdgM/CuJ7ymb/24NC4NmjQRFT ejb6WYLEjuWPg9X634b5eVkLVMRtgbCeMbfGM8IUkqYOHGTkBJQi7vlDFtf8ymi7elEC 1qTHwtlt/A3O/A4FMqIsuw7NQwfHnn+w1s6Xf09tWgU4R2bWxmFMJNn/gvEmC2OCs9BH UcBQ== X-Forwarded-Encrypted: i=1; AJvYcCUnYyi/ouSA7+p6Kn1sPdObSw3oTps6DOTB3TwmzOdBekKJlE9jOcWkqj43jStBwQyOp6Da+NQ9ebDi0TNY/G1c2GbPA3QBOUDqgaCj X-Gm-Message-State: AOJu0YwvRivxOX3TI/8IV7eq2ZlW7cPn4cdenzIbjbBq86zXuEHQAQ9J LeCNOkXrbxKwTES0icgaEgWjrYaUt1TXthi60wiQdBaQeHAxXhQBzGd4TIFBQu4= X-Received: by 2002:ac2:5e9d:0:b0:511:60eb:cb1e with SMTP id b29-20020ac25e9d000000b0051160ebcb1emr5468828lfq.67.1707403831789; Thu, 08 Feb 2024 06:50:31 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCU889DrDkoP2EL0AQ5Awe3MnvmaThnO6wNUDDSKLLdLcLXSO01+Opd6tk6yqykwTUZs9JGEyU9uLk6UkHKOH3b6QJMwN+eQWdxsfBkRUnh1E7ecjhQ6zhekL6kCkzBwfbDz+aOhfJ2wUw5BYx5M1SZzZWorGD6WDIf+g8KNLbxTGwiu8zO73rN4Y9/KEZlmoOKxEDXttleer9tkLti7m/sGiFhoaqEor+PbZHE7HuW3X/IPGh8obaRVHPDc3ihESzcXKOzc5xyPPMW/M6qQ4BsrzRQrxaXuKUoRGg13o4RdX+kTIb7U2DkxEyV8QD4aj27rFA1dE+1uSGppFVUy Received: from umbar.lan ([192.130.178.91]) by smtp.gmail.com with ESMTPSA id l4-20020ac25544000000b005114808a6a3sm18706lfk.168.2024.02.08.06.50.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Feb 2024 06:50:31 -0800 (PST) From: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Date: Thu, 08 Feb 2024 16:50:27 +0200 Subject: [PATCH v2] drm/msm/dpu: make "vblank timeout" more useful Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240208-fd-dpu-debug-timeout-v2-1-9f907f1bdd87@linaro.org> X-B4-Tracking: v=1; b=H4sIADLqxGUC/4WNQQ6CMBBFr0Jm7Zi2NgVccQ/DAugAkyglLW00h LtbuYDL95L//g6BPFOAe7GDp8SB3ZJBXQoY5m6ZCNlmBiWUFlIYHC3aNaKlPk648Ytc3JBqWY6 iHyphbpCnq6eR32f20WaeOWzOf86XJH/2TzBJlGhsXRpVVdpo2Tx56by7Oj9BexzHFzdfS0C5A AAA To: Rob Clark <robdclark@gmail.com>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Sean Paul <sean@poorly.run>, Marijn Suijten <marijn.suijten@somainline.org>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch> Cc: linux-arm-msm@vger.kernel.org, dri-devel@lists.freedesktop.org, freedreno@lists.freedesktop.org, linux-kernel@vger.kernel.org X-Mailer: b4 0.12.4 X-Developer-Signature: v=1; a=openpgp-sha256; l=1528; i=dmitry.baryshkov@linaro.org; h=from:subject:message-id; bh=x/YVZfjdNfpe+Sx6imXL9gORrTESWB+MUbneEnPObG4=; b=owEBbQGS/pANAwAKAYs8ij4CKSjVAcsmYgBlxOo3cq2IouUVB8De6f00vICVYzCDbalGwerEn 5b+e6Wey8WJATMEAAEKAB0WIQRMcISVXLJjVvC4lX+LPIo+Aiko1QUCZcTqNwAKCRCLPIo+Aiko 1SpYB/9ulG3MmOsHSUbXR9H9Tq2M8lBcWigpdCj0CDDaOuXMQFDJk/Jjvu2xo9k6lT86PcKhf5x QEE4Z8EAZai7fDQ0ugBk83nfi5loCR30Val4odndfGz3nF6dQFK1RoYcIIhMXHHR7yo6g/8NA1q +R8lwDkp5TAT7KWWPk3siq/WbCgeOuYbUn+YdLgf6O5zFbBzvVFwYU+AdpZtu3YfqOl5da0g1wu VTO8PJx280w5XFuTwyuZ74g3JKENeY7y7h3p7bIlldy69AhFLbJZnn61Ll/TMf1ar2ZqWVcDRqU /CO85qi5PON2G/OoE5B5VxPqFK3vu5VgH20Ybw3esisaS8dy X-Developer-Key: i=dmitry.baryshkov@linaro.org; a=openpgp; fpr=8F88381DD5C873E4AE487DA5199BF1243632046A X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790342704170029178 X-GMAIL-MSGID: 1790342704170029178 |
Series |
[v2] drm/msm/dpu: make "vblank timeout" more useful
|
|
Commit Message
Dmitry Baryshkov
Feb. 8, 2024, 2:50 p.m. UTC
We have several reports of vblank timeout messages. However after some
debugging it was found that there might be different causes to that.
To allow us to identify the DPU block that gets stuck, include the
actual CTL_FLUSH value into the timeout message and trigger the devcore
snapshot capture.
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
---
Changes in v2:
- Added a call to msm_disp_snapshot_state() to trigger devcore dump
(Abhinav)
- Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org
---
drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
---
base-commit: 39676dfe52331dba909c617f213fdb21015c8d10
change-id: 20240106-fd-dpu-debug-timeout-e917f0bc8063
Best regards,
Comments
On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > We have several reports of vblank timeout messages. However after some > debugging it was found that there might be different causes to that. > To allow us to identify the DPU block that gets stuck, include the > actual CTL_FLUSH value into the timeout message and trigger the devcore > snapshot capture. > > Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > --- > Changes in v2: > - Added a call to msm_disp_snapshot_state() to trigger devcore dump > (Abhinav) > - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org > --- > drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > index d0f56c5c4cce..a8d6165b3c0a 100644 > --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > msecs_to_jiffies(50)); > if (ret <= 0) { > - DPU_ERROR("vblank timeout\n"); > + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > + msm_disp_snapshot_state(phys_enc->parent->dev); There is no rate limiting in this piece of code unfortunately. So this will flood the number of snapshots. Short-term solution is you can go with a vblank_timeout_cnt and reset it in the enable() like other similar error counters. long-term solution is we need to centralize these error locations to one single dpu_encoder_error_handler() with a single counter and the error handler will print out the error code along with the snapshot instead of the snapshot being called from all over the place. > return -ETIMEDOUT; > } > > > --- > base-commit: 39676dfe52331dba909c617f213fdb21015c8d10 > change-id: 20240106-fd-dpu-debug-timeout-e917f0bc8063 > > Best regards,
On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > > > > On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > > We have several reports of vblank timeout messages. However after some > > debugging it was found that there might be different causes to that. > > To allow us to identify the DPU block that gets stuck, include the > > actual CTL_FLUSH value into the timeout message and trigger the devcore > > snapshot capture. > > > > Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > > --- > > Changes in v2: > > - Added a call to msm_disp_snapshot_state() to trigger devcore dump > > (Abhinav) > > - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org > > --- > > drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > > index d0f56c5c4cce..a8d6165b3c0a 100644 > > --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > > +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > > @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > > (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > > msecs_to_jiffies(50)); > > if (ret <= 0) { > > - DPU_ERROR("vblank timeout\n"); > > + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > > + msm_disp_snapshot_state(phys_enc->parent->dev); > > > There is no rate limiting in this piece of code unfortunately. So this > will flood the number of snapshots. Well... Yes and no. The devcoredump will destroy other snapshots if there is a pending one. So only the console will be flooded and only in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > > Short-term solution is you can go with a vblank_timeout_cnt and reset it > in the enable() like other similar error counters. > > long-term solution is we need to centralize these error locations to one > single dpu_encoder_error_handler() with a single counter and the error > handler will print out the error code along with the snapshot instead of > the snapshot being called from all over the place. > > > > > return -ETIMEDOUT; > > } > > > > > > --- > > base-commit: 39676dfe52331dba909c617f213fdb21015c8d10 > > change-id: 20240106-fd-dpu-debug-timeout-e917f0bc8063 > > > > Best regards,
On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: > On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >> >> >> >> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: >>> We have several reports of vblank timeout messages. However after some >>> debugging it was found that there might be different causes to that. >>> To allow us to identify the DPU block that gets stuck, include the >>> actual CTL_FLUSH value into the timeout message and trigger the devcore >>> snapshot capture. >>> >>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> >>> --- >>> Changes in v2: >>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump >>> (Abhinav) >>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org >>> --- >>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>> index d0f56c5c4cce..a8d6165b3c0a 100644 >>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( >>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), >>> msecs_to_jiffies(50)); >>> if (ret <= 0) { >>> - DPU_ERROR("vblank timeout\n"); >>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); >>> + msm_disp_snapshot_state(phys_enc->parent->dev); >> >> >> There is no rate limiting in this piece of code unfortunately. So this >> will flood the number of snapshots. > > Well... Yes and no. The devcoredump will destroy other snapshots if > there is a pending one. So only the console will be flooded and only > in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > Yes, true but at the same time this makes it hard to capture a good dump as potentially every vblank you could timeout so this destroy/create cycle wont end. >> >> Short-term solution is you can go with a vblank_timeout_cnt and reset it >> in the enable() like other similar error counters. >> >> long-term solution is we need to centralize these error locations to one >> single dpu_encoder_error_handler() with a single counter and the error >> handler will print out the error code along with the snapshot instead of >> the snapshot being called from all over the place. >> >> >> >>> return -ETIMEDOUT; >>> } >>> >>> >>> --- >>> base-commit: 39676dfe52331dba909c617f213fdb21015c8d10 >>> change-id: 20240106-fd-dpu-debug-timeout-e917f0bc8063 >>> >>> Best regards, > > >
On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > > > > On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: > > On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >> > >> > >> > >> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > >>> We have several reports of vblank timeout messages. However after some > >>> debugging it was found that there might be different causes to that. > >>> To allow us to identify the DPU block that gets stuck, include the > >>> actual CTL_FLUSH value into the timeout message and trigger the devcore > >>> snapshot capture. > >>> > >>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > >>> --- > >>> Changes in v2: > >>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump > >>> (Abhinav) > >>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org > >>> --- > >>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > >>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>> index d0f56c5c4cce..a8d6165b3c0a 100644 > >>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > >>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > >>> msecs_to_jiffies(50)); > >>> if (ret <= 0) { > >>> - DPU_ERROR("vblank timeout\n"); > >>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > >>> + msm_disp_snapshot_state(phys_enc->parent->dev); > >> > >> > >> There is no rate limiting in this piece of code unfortunately. So this > >> will flood the number of snapshots. > > > > Well... Yes and no. The devcoredump will destroy other snapshots if > > there is a pending one. So only the console will be flooded and only > > in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > > > > Yes, true but at the same time this makes it hard to capture a good dump > as potentially every vblank you could timeout so this destroy/create > cycle wont end. Excuse me, maybe I miss something. On the first timeout the snapshot is created. It is held by the kernel until it is fully read out from the userspace. Other snapshots will not interfere with this snapshot. Or are you worried that snapshotting takes time, so taking a snapshot will also interfere with the vblank timings for the next vblank?
On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote: > On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >> >> >> >> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: >>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >>>> >>>> >>>> >>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: >>>>> We have several reports of vblank timeout messages. However after some >>>>> debugging it was found that there might be different causes to that. >>>>> To allow us to identify the DPU block that gets stuck, include the >>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore >>>>> snapshot capture. >>>>> >>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> >>>>> --- >>>>> Changes in v2: >>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump >>>>> (Abhinav) >>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org >>>>> --- >>>>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- >>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>> index d0f56c5c4cce..a8d6165b3c0a 100644 >>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( >>>>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), >>>>> msecs_to_jiffies(50)); >>>>> if (ret <= 0) { >>>>> - DPU_ERROR("vblank timeout\n"); >>>>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); >>>>> + msm_disp_snapshot_state(phys_enc->parent->dev); >>>> >>>> >>>> There is no rate limiting in this piece of code unfortunately. So this >>>> will flood the number of snapshots. >>> >>> Well... Yes and no. The devcoredump will destroy other snapshots if >>> there is a pending one. So only the console will be flooded and only >>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. >>> >> >> Yes, true but at the same time this makes it hard to capture a good dump >> as potentially every vblank you could timeout so this destroy/create >> cycle wont end. > > Excuse me, maybe I miss something. On the first timeout the snapshot > is created. It is held by the kernel until it is fully read out from > the userspace. Other snapshots will not interfere with this snapshot. > For every new snapshot a new devcoredump device will be created which should remain till it has been read. But now this will be created every blank. IMO, this is really too much data for no reason. Subsequent vblank timeouts are not going to give any new information compared to the existing snapshot of the first vblank timeout thats why we should just create the snapshot when the first error happens and stop. For other frame done timeouts, infact subsequent timeouts without any sort of recovery in between are quite misleading because hardware was already not able to fetch the previous frame so it will most likely not fetch the next one either till it has recovered. Typically thats why these vblank timeouts happen in a flurry as the hardware never really recovered from the first timeout. > Or are you worried that snapshotting takes time, so taking a snapshot > will also interfere with the vblank timings for the next vblank? > Yes this is another point.
On Wed, 21 Feb 2024 at 00:40, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > > > > On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote: > > On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >> > >> > >> > >> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: > >>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >>>> > >>>> > >>>> > >>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > >>>>> We have several reports of vblank timeout messages. However after some > >>>>> debugging it was found that there might be different causes to that. > >>>>> To allow us to identify the DPU block that gets stuck, include the > >>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore > >>>>> snapshot capture. > >>>>> > >>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > >>>>> --- > >>>>> Changes in v2: > >>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump > >>>>> (Abhinav) > >>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org > >>>>> --- > >>>>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > >>>>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>> index d0f56c5c4cce..a8d6165b3c0a 100644 > >>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > >>>>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > >>>>> msecs_to_jiffies(50)); > >>>>> if (ret <= 0) { > >>>>> - DPU_ERROR("vblank timeout\n"); > >>>>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > >>>>> + msm_disp_snapshot_state(phys_enc->parent->dev); > >>>> > >>>> > >>>> There is no rate limiting in this piece of code unfortunately. So this > >>>> will flood the number of snapshots. > >>> > >>> Well... Yes and no. The devcoredump will destroy other snapshots if > >>> there is a pending one. So only the console will be flooded and only > >>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > >>> > >> > >> Yes, true but at the same time this makes it hard to capture a good dump > >> as potentially every vblank you could timeout so this destroy/create > >> cycle wont end. > > > > Excuse me, maybe I miss something. On the first timeout the snapshot > > is created. It is held by the kernel until it is fully read out from > > the userspace. Other snapshots will not interfere with this snapshot. > > > > For every new snapshot a new devcoredump device will be created which > should remain till it has been read. But now this will be created every > blank. IMO, this is really too much data for no reason. No-no-no. If there is a devcoredump for a device, the next one will not be created. See dev_coredumpm(). So all the snapshots will be created and then destroyed immediately. > > Subsequent vblank timeouts are not going to give any new information > compared to the existing snapshot of the first vblank timeout thats why > we should just create the snapshot when the first error happens and stop. > > For other frame done timeouts, infact subsequent timeouts without any > sort of recovery in between are quite misleading because hardware was > already not able to fetch the previous frame so it will most likely not > fetch the next one either till it has recovered. Typically thats why > these vblank timeouts happen in a flurry as the hardware never really > recovered from the first timeout. > > > Or are you worried that snapshotting takes time, so taking a snapshot > > will also interfere with the vblank timings for the next vblank? > > > > Yes this is another point.
On 2/20/2024 2:42 PM, Dmitry Baryshkov wrote: > On Wed, 21 Feb 2024 at 00:40, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >> >> >> >> On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote: >>> On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >>>> >>>> >>>> >>>> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: >>>>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: >>>>>>> We have several reports of vblank timeout messages. However after some >>>>>>> debugging it was found that there might be different causes to that. >>>>>>> To allow us to identify the DPU block that gets stuck, include the >>>>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore >>>>>>> snapshot capture. >>>>>>> >>>>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> >>>>>>> --- >>>>>>> Changes in v2: >>>>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump >>>>>>> (Abhinav) >>>>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org >>>>>>> --- >>>>>>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- >>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>>>> index d0f56c5c4cce..a8d6165b3c0a 100644 >>>>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c >>>>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( >>>>>>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), >>>>>>> msecs_to_jiffies(50)); >>>>>>> if (ret <= 0) { >>>>>>> - DPU_ERROR("vblank timeout\n"); >>>>>>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); >>>>>>> + msm_disp_snapshot_state(phys_enc->parent->dev); >>>>>> >>>>>> >>>>>> There is no rate limiting in this piece of code unfortunately. So this >>>>>> will flood the number of snapshots. >>>>> >>>>> Well... Yes and no. The devcoredump will destroy other snapshots if >>>>> there is a pending one. So only the console will be flooded and only >>>>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. >>>>> >>>> >>>> Yes, true but at the same time this makes it hard to capture a good dump >>>> as potentially every vblank you could timeout so this destroy/create >>>> cycle wont end. >>> >>> Excuse me, maybe I miss something. On the first timeout the snapshot >>> is created. It is held by the kernel until it is fully read out from >>> the userspace. Other snapshots will not interfere with this snapshot. >>> >> >> For every new snapshot a new devcoredump device will be created which >> should remain till it has been read. But now this will be created every >> blank. IMO, this is really too much data for no reason. > > No-no-no. If there is a devcoredump for a device, the next one will > not be created. See dev_coredumpm(). > So all the snapshots will be created and then destroyed immediately. > hmm ... I have certainly seen devcd_count go higher than one (but not more than 2). I am wondering whether this was because of some race condition of the previous destroy / new create. But anyway, this part is clear now. thanks. >> >> Subsequent vblank timeouts are not going to give any new information >> compared to the existing snapshot of the first vblank timeout thats why >> we should just create the snapshot when the first error happens and stop. >> >> For other frame done timeouts, infact subsequent timeouts without any >> sort of recovery in between are quite misleading because hardware was >> already not able to fetch the previous frame so it will most likely not >> fetch the next one either till it has recovered. Typically thats why >> these vblank timeouts happen in a flurry as the hardware never really >> recovered from the first timeout. >> >>> Or are you worried that snapshotting takes time, so taking a snapshot >>> will also interfere with the vblank timings for the next vblank? >>> >> >> Yes this is another point. > snapshots will still be captured every vblank timeout and reading through the entire DPU reg space every vblank timeout is certainly something we can avoid.
On Wed, 21 Feb 2024 at 01:04, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > > > > On 2/20/2024 2:42 PM, Dmitry Baryshkov wrote: > > On Wed, 21 Feb 2024 at 00:40, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >> > >> > >> > >> On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote: > >>> On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >>>> > >>>> > >>>> > >>>> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: > >>>>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > >>>>>>> We have several reports of vblank timeout messages. However after some > >>>>>>> debugging it was found that there might be different causes to that. > >>>>>>> To allow us to identify the DPU block that gets stuck, include the > >>>>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore > >>>>>>> snapshot capture. > >>>>>>> > >>>>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > >>>>>>> --- > >>>>>>> Changes in v2: > >>>>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump > >>>>>>> (Abhinav) > >>>>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org > >>>>>>> --- > >>>>>>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > >>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> index d0f56c5c4cce..a8d6165b3c0a 100644 > >>>>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > >>>>>>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > >>>>>>> msecs_to_jiffies(50)); > >>>>>>> if (ret <= 0) { > >>>>>>> - DPU_ERROR("vblank timeout\n"); > >>>>>>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > >>>>>>> + msm_disp_snapshot_state(phys_enc->parent->dev); > >>>>>> > >>>>>> > >>>>>> There is no rate limiting in this piece of code unfortunately. So this > >>>>>> will flood the number of snapshots. > >>>>> > >>>>> Well... Yes and no. The devcoredump will destroy other snapshots if > >>>>> there is a pending one. So only the console will be flooded and only > >>>>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > >>>>> > >>>> > >>>> Yes, true but at the same time this makes it hard to capture a good dump > >>>> as potentially every vblank you could timeout so this destroy/create > >>>> cycle wont end. > >>> > >>> Excuse me, maybe I miss something. On the first timeout the snapshot > >>> is created. It is held by the kernel until it is fully read out from > >>> the userspace. Other snapshots will not interfere with this snapshot. > >>> > >> > >> For every new snapshot a new devcoredump device will be created which > >> should remain till it has been read. But now this will be created every > >> blank. IMO, this is really too much data for no reason. > > > > No-no-no. If there is a devcoredump for a device, the next one will > > not be created. See dev_coredumpm(). > > So all the snapshots will be created and then destroyed immediately. > > > > hmm ... I have certainly seen devcd_count go higher than one (but not > more than 2). I am wondering whether this was because of some race > condition of the previous destroy / new create. > > But anyway, this part is clear now. thanks. > > >> > >> Subsequent vblank timeouts are not going to give any new information > >> compared to the existing snapshot of the first vblank timeout thats why > >> we should just create the snapshot when the first error happens and stop. > >> > >> For other frame done timeouts, infact subsequent timeouts without any > >> sort of recovery in between are quite misleading because hardware was > >> already not able to fetch the previous frame so it will most likely not > >> fetch the next one either till it has recovered. Typically thats why > >> these vblank timeouts happen in a flurry as the hardware never really > >> recovered from the first timeout. > >> > >>> Or are you worried that snapshotting takes time, so taking a snapshot > >>> will also interfere with the vblank timings for the next vblank? > >>> > >> > >> Yes this is another point. > > > > snapshots will still be captured every vblank timeout and reading > through the entire DPU reg space every vblank timeout is certainly > something we can avoid. Ack.
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c index d0f56c5c4cce..a8d6165b3c0a 100644 --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( (hw_ctl->ops.get_flush_register(hw_ctl) == 0), msecs_to_jiffies(50)); if (ret <= 0) { - DPU_ERROR("vblank timeout\n"); + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); + msm_disp_snapshot_state(phys_enc->parent->dev); return -ETIMEDOUT; }