Message ID | 20231218155927.368881-1-robdclark@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:24d3:b0:fb:cd0c:d3e with SMTP id r19csp1341066dyi; Mon, 18 Dec 2023 08:00:43 -0800 (PST) X-Google-Smtp-Source: AGHT+IHQv6gLBVmvtGMaIBfC12AwBheffVr5s3fPjlnaRGy1Z86YqjQ7+2M7kr6eZvhGU+cDY5ep X-Received: by 2002:a05:6214:b6d:b0:67f:4a18:1005 with SMTP id ey13-20020a0562140b6d00b0067f4a181005mr1479563qvb.76.1702915242806; Mon, 18 Dec 2023 08:00:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702915242; cv=none; d=google.com; s=arc-20160816; b=hBoUghM5W4K94WDizQ7MpYPcOPPMrMi6AY6ynhIm39zHmB4T0NZUcMbMvV6pIQaf3/ AF/0EkbAt0VZDDF0t/XjP7TKIMlxEDTzCCOzLNfS0ZBTW/5g9LGgpboE05KUGA0Wpqj5 1DNxLfvqNXXB6QZu4B+MIwpwwu8q2Yq9jonr9pr9lzlJwupdN6mXw3CtsSc2bcreY4qY gtePUMZO9kN9eo/E1jJCusu+DzMxwiI/uCxI5hUUu5MJlRkreYp0nYdPKJ29sRaXpjR2 cEzZyteX7RB74m2Nf8bcfg/EEj3Yo8EC+oLVX9RZbstDvTvSHg15CTYLGA8qA1iSVbVU VsZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=X5RGGPLxL7RmqDldU4zVC+vRt1k5TON9j6VLsGbzQWQ=; fh=6xdmhNNSU+6EehOe8s60dwTS908uUpu6J98h32sibSs=; b=xZ2qUcTF6H2w1cjLY8C/rtIi+2rLxwTnvsgTVdB1fidxTcSfbRh/YtMxQ3P9gol7Kt rURNGtN4XY3UV1XjRq8NUUYpkERGhL78YanpPpHbDjaskMeWBCe7ydV8tXS1DNYSfor1 KOQNK5bsc6ys1I5kHOuoL2tLa4wIi0zlFU2lsGPpR75hpFCyHr4gTtEkZjUeGgqDPJan ADzEfUmQg9jDmx+TTt6DZIqzwBTfLetKsgUsETmocZNmwNot01TXulg1cvp4cKVPNqbs hmffuRh3N3YsvsD6US/Kem/JaQVqqynOw6LCN5wcH/4Fey54VEXRTpo+KjPe2umVJUVa lY/g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=XwxmF33H; spf=pass (google.com: domain of linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id vu28-20020a05620a561c00b0076efa2057a6si23098126qkn.576.2023.12.18.08.00.42 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 08:00:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=XwxmF33H; spf=pass (google.com: domain of linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4005-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 90D051C25520 for <ouuuleilei@gmail.com>; Mon, 18 Dec 2023 16:00:42 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 769E94238C; Mon, 18 Dec 2023 15:59:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XwxmF33H" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5FD093D546; Mon, 18 Dec 2023 15:59:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-5cd870422c8so467076a12.0; Mon, 18 Dec 2023 07:59:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702915175; x=1703519975; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=X5RGGPLxL7RmqDldU4zVC+vRt1k5TON9j6VLsGbzQWQ=; b=XwxmF33HwQigmDruhBCU8ErAmTBN4SfuYJBclrU4HfBruz5mGtS+MHbHvMK+JNNIqd kXL1BWknMhgM5sYHq2u5IUDhmk+qodhsn7+naOEDdoG31kyASMK5ZePG2fXfaK7L5Lbo NrobTXrVTyBnWFBjYwzWosY3Xv1rqka3PEA6ArRhmAEjvi11o6OWNERoHGC3Ft/f1IOG ozVWXe63E7lzgES8oPDsdFt7iOpt0mbLfpw6AHVFSDpNH96SA18A0l++L0LP+sQnHW6/ SUal4T5KmA8sUxaiYIXGP44/kB2fpL3JZUT/JJBXAjsbslN7q1BHkqn2sjz1e3q11VNz shgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702915175; x=1703519975; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=X5RGGPLxL7RmqDldU4zVC+vRt1k5TON9j6VLsGbzQWQ=; b=xGGmiqqvyzd6wxxSwslPEMFQGZ5y1esqBofENcN4oJds0vv1tA/eS6KGBdS4sp/2cV StMKRRqPYx0bse/+JF9xi3CXe7aFUWVwewrFIvAwoFpV5rssioI3UGwHP28Tzy0wZrqY Hl2hKF9eoyWjWCnRiLat8ceirawQT7DLSkqEj0ZZEhqpB5Kgw53swcSioTnzTSQpfV3T bsbToParXveqcpuHUIOIzttoepU2X5aHeNq6e51JqvHOiDI65t4uYtCfPWSYjKuIescV /hlxlAL9Pef6xAPDc+wIVaZ5hwq03mquw8E9HpUnuSnPoqR6M/cKrCpZE36ELDrVD+uc T9Qg== X-Gm-Message-State: AOJu0Yxori0xTQBIzlkOSDtEIeIsAfXffK4o1H8NPtxtEeimI8xJELBF 8VV5e2VqFvwFzz4+N+v6i6A= X-Received: by 2002:a17:90a:bb04:b0:280:cd7b:1fa5 with SMTP id u4-20020a17090abb0400b00280cd7b1fa5mr6789760pjr.4.1702915175541; Mon, 18 Dec 2023 07:59:35 -0800 (PST) Received: from localhost ([2a00:79e1:2e00:1301:e1c5:6354:b45d:8ffc]) by smtp.gmail.com with ESMTPSA id pb7-20020a17090b3c0700b0028aea6c24bcsm6535129pjb.53.2023.12.18.07.59.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 07:59:34 -0800 (PST) From: Rob Clark <robdclark@gmail.com> To: dri-devel@lists.freedesktop.org Cc: freedreno@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, Rob Clark <robdclark@chromium.org>, David Heidelberg <david.heidelberg@collabora.com>, Rob Clark <robdclark@gmail.com>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, Sean Paul <sean@poorly.run>, Marijn Suijten <marijn.suijten@somainline.org>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Konrad Dybcio <konrad.dybcio@linaro.org>, Akhil P Oommen <quic_akhilpo@quicinc.com>, Danylo Piliaiev <dpiliaiev@igalia.com>, Bjorn Andersson <andersson@kernel.org>, linux-kernel@vger.kernel.org (open list) Subject: [PATCH] drm/msm/a6xx: Fix recovery vs runpm race Date: Mon, 18 Dec 2023 07:59:24 -0800 Message-ID: <20231218155927.368881-1-robdclark@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785636053780777544 X-GMAIL-MSGID: 1785636053780777544 |
Series |
drm/msm/a6xx: Fix recovery vs runpm race
|
|
Commit Message
Rob Clark
Dec. 18, 2023, 3:59 p.m. UTC
From: Rob Clark <robdclark@chromium.org> a6xx_recover() is relying on the gpu lock to serialize against incoming submits doing a runpm get, as it tries to temporarily balance out the runpm gets with puts in order to power off the GPU. Unfortunately this gets worse when we (in a later patch) will move the runpm get out of the scheduler thread/work to move it out of the fence signaling path. Instead we can just simplify the whole thing by using force_suspend() / force_resume() instead of trying to be clever. Reported-by: David Heidelberg <david.heidelberg@collabora.com> Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272 Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm") Signed-off-by: Rob Clark <robdclark@chromium.org> --- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-)
Comments
On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > a6xx_recover() is relying on the gpu lock to serialize against incoming > submits doing a runpm get, as it tries to temporarily balance out the > runpm gets with puts in order to power off the GPU. Unfortunately this > gets worse when we (in a later patch) will move the runpm get out of the > scheduler thread/work to move it out of the fence signaling path. > > Instead we can just simplify the whole thing by using force_suspend() / > force_resume() instead of trying to be clever. At some places, we take a pm_runtime vote and access the gpu registers assuming it will be powered until we drop the vote. a6xx_get_timestamp() is an example. If we do a force suspend, it may cause bus errors from those threads. Now you have to serialize every place we do runtime_get/put with a mutex. Or is there a better way to handle the 'later patch' you mentioned? -Akhil. > > Reported-by: David Heidelberg <david.heidelberg@collabora.com> > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272 > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++---------- > 1 file changed, 2 insertions(+), 10 deletions(-) > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > index 268737e59131..a5660d63535b 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); > dev_pm_genpd_synced_poweroff(gmu->cxpd); > > - /* Drop the rpm refcount from active submits */ > - if (active_submits) > - pm_runtime_put(&gpu->pdev->dev); > - > - /* And the final one from recover worker */ > - pm_runtime_put_sync(&gpu->pdev->dev); > + pm_runtime_force_suspend(&gpu->pdev->dev); > > if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) > DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > pm_runtime_use_autosuspend(&gpu->pdev->dev); > > - if (active_submits) > - pm_runtime_get(&gpu->pdev->dev); > - > - pm_runtime_get_sync(&gpu->pdev->dev); > + pm_runtime_force_resume(&gpu->pdev->dev); > > gpu->active_submits = active_submits; > mutex_unlock(&gpu->active_lock); > -- > 2.43.0 >
On Fri, Dec 22, 2023 at 11:58 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote: > > > > From: Rob Clark <robdclark@chromium.org> > > > > a6xx_recover() is relying on the gpu lock to serialize against incoming > > submits doing a runpm get, as it tries to temporarily balance out the > > runpm gets with puts in order to power off the GPU. Unfortunately this > > gets worse when we (in a later patch) will move the runpm get out of the > > scheduler thread/work to move it out of the fence signaling path. > > > > Instead we can just simplify the whole thing by using force_suspend() / > > force_resume() instead of trying to be clever. > > At some places, we take a pm_runtime vote and access the gpu > registers assuming it will be powered until we drop the vote. a6xx_get_timestamp() > is an example. If we do a force suspend, it may cause bus errors from > those threads. Now you have to serialize every place we do runtime_get/put with a > mutex. Or is there a better way to handle the 'later patch' you > mentioned? So I was running into issues, when I started adding an igt test to stress test recovery vs multi-threaded submit, with cxpd not always suspending and getting "cx gdsc did not collapse", which may be related. I was considering using force_suspend() on the gmu and cxpd if gpu->hang==true, I'm not sure. I ran out of time to play with this when I was in the office. The issue the 'later patch' is trying to deal with is getting memory allocations out of the "fence signaling path", ie. out from the drm/sched kthread/worker. One way to do that, without dragging all of runpm/device-link/etc into it is to do the runpm get in the submit ioctl before enqueuing the job to the scheduler. But then we can hold a lock to protect against racing with recovery. BR, -R > -Akhil. > > > > > Reported-by: David Heidelberg <david.heidelberg@collabora.com> > > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272 > > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++---------- > > 1 file changed, 2 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > index 268737e59131..a5660d63535b 100644 > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); > > dev_pm_genpd_synced_poweroff(gmu->cxpd); > > > > - /* Drop the rpm refcount from active submits */ > > - if (active_submits) > > - pm_runtime_put(&gpu->pdev->dev); > > - > > - /* And the final one from recover worker */ > > - pm_runtime_put_sync(&gpu->pdev->dev); > > + pm_runtime_force_suspend(&gpu->pdev->dev); > > > > if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) > > DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); > > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > > > pm_runtime_use_autosuspend(&gpu->pdev->dev); > > > > - if (active_submits) > > - pm_runtime_get(&gpu->pdev->dev); > > - > > - pm_runtime_get_sync(&gpu->pdev->dev); > > + pm_runtime_force_resume(&gpu->pdev->dev); > > > > gpu->active_submits = active_submits; > > mutex_unlock(&gpu->active_lock); > > -- > > 2.43.0 > >
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 268737e59131..a5660d63535b 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); dev_pm_genpd_synced_poweroff(gmu->cxpd); - /* Drop the rpm refcount from active submits */ - if (active_submits) - pm_runtime_put(&gpu->pdev->dev); - - /* And the final one from recover worker */ - pm_runtime_put_sync(&gpu->pdev->dev); + pm_runtime_force_suspend(&gpu->pdev->dev); if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) pm_runtime_use_autosuspend(&gpu->pdev->dev); - if (active_submits) - pm_runtime_get(&gpu->pdev->dev); - - pm_runtime_get_sync(&gpu->pdev->dev); + pm_runtime_force_resume(&gpu->pdev->dev); gpu->active_submits = active_submits; mutex_unlock(&gpu->active_lock);