Message ID | 20221111194957.4046771-2-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp938991wru; Fri, 11 Nov 2022 12:02:23 -0800 (PST) X-Google-Smtp-Source: AA0mqf6ayvFgAyhi8je/+Gb0YjSEAEGLiqFYLU/SN9oPelpgvNITO9fkp/0cXubwxPR3Takon++v X-Received: by 2002:a17:906:da0c:b0:78d:b43c:81be with SMTP id fi12-20020a170906da0c00b0078db43c81bemr3169464ejb.600.1668196943542; Fri, 11 Nov 2022 12:02:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668196943; cv=none; d=google.com; s=arc-20160816; b=0ayr31kCpp8WMBfIQ3fGgWFGw8kpAE/Z/7Luydb3LqO2y8YhfpYd410vI24btovZK5 Lz3xmfyQa/eOcsQGXotT08MVxh7sPHoXVe3iOW6ZJbIxMZaoX3nqyyizhidPEc4whn1K KiM0ySVFYH6lVBm3JFDsj1iASe+UzzNApLxFeSB6sl76JL9XPKAGp9lVTU9Igc4aCmi+ tc3DWA+pmvyr7NJjDn1uyrPsU7FOFhSl1FMnYqnkTBTUvKrAmM2S1EKewsgorCXp+u+C ZhGMTWJFJgPBC6wjn+4sC5b/RXPw+8ByFBsMfjBL1YpEmvBpMapINMm0ykTn1bHPsgxZ lTCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=1KH+9DnybePERv04muJMWfidDb3fPimLOhb4bQHi9hY=; b=PY8UedMngeXbIstQeWzbKtQtxJLSmpsy2o0cwc6/s/phaRut+RiyCfE8SWYIwt99Ge ARueAncZWZ+j7lwk14RhHUPpvNmRis0HlcBduaKEBIb3p9uQlE7B1oGoCtUdiaiREEwX PeJe9OZK3TmHX9kFVYgxqjx7s2QzmCyFyu0ZizgXkjTxGu4m15ipvYZYMbM0LboJOrCh 4xuMXvjrTjogdh7HEo+J0MQlig5g6876EqZAwiiF6ilY/HaQ+ybwtKTeYdI+KHptWnaz i9qGP4xIFNNmndx5g4ynf9Yw+I1vuhzQKmsBpN7xJoO+I4bbqPCIeDBU9g/cH1N+M2OL /Q8g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=RHjf12jH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hv17-20020a17090760d100b0078dfd7054absi2837686ejc.544.2022.11.11.12.01.53; Fri, 11 Nov 2022 12:02:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=RHjf12jH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234241AbiKKTuV (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Fri, 11 Nov 2022 14:50:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234063AbiKKTuS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Nov 2022 14:50:18 -0500 Received: from mail-qt1-x829.google.com (mail-qt1-x829.google.com [IPv6:2607:f8b0:4864:20::829]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F144583398 for <linux-kernel@vger.kernel.org>; Fri, 11 Nov 2022 11:50:16 -0800 (PST) Received: by mail-qt1-x829.google.com with SMTP id cg5so3236110qtb.12 for <linux-kernel@vger.kernel.org>; Fri, 11 Nov 2022 11:50:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1KH+9DnybePERv04muJMWfidDb3fPimLOhb4bQHi9hY=; b=RHjf12jHGHZlV6LxazAvl/5fZwStrXPkGDsdKj20bcVMeSXNquWMtMx/Bsdq/o5JYm eAfmViLAt2OLTkbQbsClDh4FCyBu1w5nu3AzI7kME+tNrhg4OayMskXk5hBLP5WSLQYN vTRUqiZVfCQevVB5j1H2sbOljVncceXkIvv5Y= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1KH+9DnybePERv04muJMWfidDb3fPimLOhb4bQHi9hY=; b=iOScaaEbt/DU2M/6/sbI126oqJ69ZsQBH3jTHxCHIIYQB6RIGGSE50mRj/cfOat4te rwM0jAhjJdOlxIQInf2mQcDjXO5BjaPL2L33gTcUgyq5Oy1M1SBtOgrQe73JNYnavKBY vxfWe+7wkvcoGpDR3k31P/yjKIoAkkSmqTUi9Z+L0jyNB41U1zwzIVQ5wgw3m+11Bam5 BPI1GFOPJMmsLKPqSnYSN+sRNmyP07ZlYChvClvxB6cLHicNh4p5D9qjp+En9H5rVvtz WMOwawWOJ4ihMe6KjZHVW9Mz15/Pz6+IEHqI3ownjkWspzKiT0mn6IT4QZiciOHjA01Y vw6Q== X-Gm-Message-State: ANoB5plwhhME5Ec5kKP86QBXqGebDNZsfpkng9EQgTlj4aXxy661mCs4 1508vdpxx0wLm8j7jXqwsKnxh601gFLcsQ== X-Received: by 2002:ac8:44d7:0:b0:39c:c7ba:4af4 with SMTP id b23-20020ac844d7000000b0039cc7ba4af4mr2862729qto.99.1668196215864; Fri, 11 Nov 2022 11:50:15 -0800 (PST) Received: from joelboxx.c.googlers.com.com (228.221.150.34.bc.googleusercontent.com. [34.150.221.228]) by smtp.gmail.com with ESMTPSA id y11-20020a05620a25cb00b006ef1a8f1b81sm1985277qko.5.2022.11.11.11.50.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 11 Nov 2022 11:50:15 -0800 (PST) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>, Rob Clark <robdclark@chromium.org>, Steven Rostedt <rostedt@goodmis.org>, Ricardo Ribalda <ribalda@chromium.org>, Ross Zwisler <zwisler@kernel.org>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Akhil P Oommen <quic_akhilpo@quicinc.com>, Daniel Vetter <daniel@ffwll.ch>, David Airlie <airlied@gmail.com>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, dri-devel@lists.freedesktop.org, Emma Anholt <emma@anholt.net>, freedreno@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, Rob Clark <robdclark@gmail.com>, Sean Paul <sean@poorly.run>, Vladimir Lypak <vladimir.lypak@gmail.com> Subject: [PATCH 2/2] adreno: Detect shutdown during get_param() Date: Fri, 11 Nov 2022 19:49:57 +0000 Message-Id: <20221111194957.4046771-2-joel@joelfernandes.org> X-Mailer: git-send-email 2.38.1.493.g58b659f92b-goog In-Reply-To: <20221111194957.4046771-1-joel@joelfernandes.org> References: <20221111194957.4046771-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749231278270154642?= X-GMAIL-MSGID: =?utf-8?q?1749231278270154642?= |
Series |
[1/2] adreno: Shutdown the GPU properly
|
|
Commit Message
Joel Fernandes
Nov. 11, 2022, 7:49 p.m. UTC
Even though the GPU is shut down, during kexec reboot we can have userspace
still running. This is especially true if KEXEC_JUMP is not enabled, because we
do not freeze userspace in this case.
To prevent crashes, track that the GPU is shutdown and prevent get_param() from
accessing GPU resources if we find it shutdown.
This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
[ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[ 292.534326] Call trace:
[ 292.534328] dump_backtrace+0x0/0x1d4
[ 292.534337] show_stack+0x20/0x2c
[ 292.534342] dump_stack_lvl+0x60/0x78
[ 292.534347] dump_stack+0x18/0x38
[ 292.534352] panic+0x148/0x3b0
[ 292.534357] nmi_panic+0x80/0x94
[ 292.534364] arm64_serror_panic+0x70/0x7c
[ 292.534369] do_serror+0x0/0x7c
[ 292.534372] do_serror+0x54/0x7c
[ 292.534377] el1h_64_error_handler+0x34/0x4c
[ 292.534381] el1h_64_error+0x7c/0x80
[ 292.534386] el1_interrupt+0x20/0x58
[ 292.534389] el1h_64_irq_handler+0x18/0x24
[ 292.534395] el1h_64_irq+0x7c/0x80
[ 292.534399] local_daif_inherit+0x10/0x18
[ 292.534405] el1h_64_sync_handler+0x48/0xb4
[ 292.534410] el1h_64_sync+0x7c/0x80
[ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc
[ 292.534422] a6xx_get_timestamp+0x40/0xb4
[ 292.534426] adreno_get_param+0x12c/0x1e0
[ 292.534433] msm_ioctl_get_param+0x64/0x70
[ 292.534440] drm_ioctl_kernel+0xe8/0x158
[ 292.534448] drm_ioctl+0x208/0x320
[ 292.534453] __arm64_sys_ioctl+0x98/0xd0
[ 292.534461] invoke_syscall+0x4c/0x118
[ 292.534467] el0_svc_common+0x98/0x104
[ 292.534473] do_el0_svc+0x30/0x80
[ 292.534478] el0_svc+0x20/0x50
[ 292.534481] el0t_64_sync_handler+0x78/0x108
[ 292.534485] el0t_64_sync+0x1a4/0x1a8
[ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
[ 292.534635] PHYS_OFFSET: 0x80000000
[ 292.534638] CPU features: 0x40018541,a3300e42
[ 292.534644] Memory Limit: none
Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +-
drivers/gpu/drm/msm/msm_gpu.h | 3 +++
3 files changed, 5 insertions(+), 1 deletion(-)
Comments
On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: > Even though the GPU is shut down, during kexec reboot we can have userspace > still running. This is especially true if KEXEC_JUMP is not enabled, because we > do not freeze userspace in this case. > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from > accessing GPU resources if we find it shutdown. > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: > > [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt > [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) > [ 292.534326] Call trace: > [ 292.534328] dump_backtrace+0x0/0x1d4 > [ 292.534337] show_stack+0x20/0x2c > [ 292.534342] dump_stack_lvl+0x60/0x78 > [ 292.534347] dump_stack+0x18/0x38 > [ 292.534352] panic+0x148/0x3b0 > [ 292.534357] nmi_panic+0x80/0x94 > [ 292.534364] arm64_serror_panic+0x70/0x7c > [ 292.534369] do_serror+0x0/0x7c > [ 292.534372] do_serror+0x54/0x7c > [ 292.534377] el1h_64_error_handler+0x34/0x4c > [ 292.534381] el1h_64_error+0x7c/0x80 > [ 292.534386] el1_interrupt+0x20/0x58 > [ 292.534389] el1h_64_irq_handler+0x18/0x24 > [ 292.534395] el1h_64_irq+0x7c/0x80 > [ 292.534399] local_daif_inherit+0x10/0x18 > [ 292.534405] el1h_64_sync_handler+0x48/0xb4 > [ 292.534410] el1h_64_sync+0x7c/0x80 > [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc > [ 292.534422] a6xx_get_timestamp+0x40/0xb4 > [ 292.534426] adreno_get_param+0x12c/0x1e0 > [ 292.534433] msm_ioctl_get_param+0x64/0x70 > [ 292.534440] drm_ioctl_kernel+0xe8/0x158 > [ 292.534448] drm_ioctl+0x208/0x320 > [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 > [ 292.534461] invoke_syscall+0x4c/0x118 > [ 292.534467] el0_svc_common+0x98/0x104 > [ 292.534473] do_el0_svc+0x30/0x80 > [ 292.534478] el0_svc+0x20/0x50 > [ 292.534481] el0t_64_sync_handler+0x78/0x108 > [ 292.534485] el0t_64_sync+0x1a4/0x1a8 > [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 > [ 292.534635] PHYS_OFFSET: 0x80000000 > [ 292.534638] CPU features: 0x40018541,a3300e42 > [ 292.534644] Memory Limit: none > > Cc: Rob Clark <robdclark@chromium.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ricardo Ribalda <ribalda@chromium.org> > Cc: Ross Zwisler <zwisler@kernel.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- > drivers/gpu/drm/msm/msm_gpu.h | 3 +++ > 3 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > index f0cff62812c3..03d912dc0130 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) > { > struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); > > + gpu->is_shutdown = true; > WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); > } > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > index 382fb7f9e497..6903c6892469 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, > struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); > > /* No pointer params yet */ > - if (*len != 0) > + if (*len != 0 || gpu->is_shutdown) > return -EINVAL; This will race with shutdown. Probably, propagating back the return value of pm_runtime_get() in every possible ioctl call path is the right thing to do. I have never thought about this scenario. Do you know why userspace is not freezed before kexec? -Akhil. > > switch (param) { > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > index ff911e7305ce..f18b0a91442b 100644 > --- a/drivers/gpu/drm/msm/msm_gpu.h > +++ b/drivers/gpu/drm/msm/msm_gpu.h > @@ -214,6 +214,9 @@ struct msm_gpu { > /* does gpu need hw_init? */ > bool needs_hw_init; > > + /* is the GPU shutdown? */ > + bool is_shutdown; > + > /** > * global_faults: number of GPU hangs not attributed to a particular > * address space
> On Nov 11, 2022, at 4:28 PM, Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: >> Even though the GPU is shut down, during kexec reboot we can have userspace >> still running. This is especially true if KEXEC_JUMP is not enabled, because we >> do not freeze userspace in this case. >> >> To prevent crashes, track that the GPU is shutdown and prevent get_param() from >> accessing GPU resources if we find it shutdown. >> >> This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: >> >> [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt >> [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) >> [ 292.534326] Call trace: >> [ 292.534328] dump_backtrace+0x0/0x1d4 >> [ 292.534337] show_stack+0x20/0x2c >> [ 292.534342] dump_stack_lvl+0x60/0x78 >> [ 292.534347] dump_stack+0x18/0x38 >> [ 292.534352] panic+0x148/0x3b0 >> [ 292.534357] nmi_panic+0x80/0x94 >> [ 292.534364] arm64_serror_panic+0x70/0x7c >> [ 292.534369] do_serror+0x0/0x7c >> [ 292.534372] do_serror+0x54/0x7c >> [ 292.534377] el1h_64_error_handler+0x34/0x4c >> [ 292.534381] el1h_64_error+0x7c/0x80 >> [ 292.534386] el1_interrupt+0x20/0x58 >> [ 292.534389] el1h_64_irq_handler+0x18/0x24 >> [ 292.534395] el1h_64_irq+0x7c/0x80 >> [ 292.534399] local_daif_inherit+0x10/0x18 >> [ 292.534405] el1h_64_sync_handler+0x48/0xb4 >> [ 292.534410] el1h_64_sync+0x7c/0x80 >> [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc >> [ 292.534422] a6xx_get_timestamp+0x40/0xb4 >> [ 292.534426] adreno_get_param+0x12c/0x1e0 >> [ 292.534433] msm_ioctl_get_param+0x64/0x70 >> [ 292.534440] drm_ioctl_kernel+0xe8/0x158 >> [ 292.534448] drm_ioctl+0x208/0x320 >> [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 >> [ 292.534461] invoke_syscall+0x4c/0x118 >> [ 292.534467] el0_svc_common+0x98/0x104 >> [ 292.534473] do_el0_svc+0x30/0x80 >> [ 292.534478] el0_svc+0x20/0x50 >> [ 292.534481] el0t_64_sync_handler+0x78/0x108 >> [ 292.534485] el0t_64_sync+0x1a4/0x1a8 >> [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 >> [ 292.534635] PHYS_OFFSET: 0x80000000 >> [ 292.534638] CPU features: 0x40018541,a3300e42 >> [ 292.534644] Memory Limit: none >> >> Cc: Rob Clark <robdclark@chromium.org> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Ricardo Ribalda <ribalda@chromium.org> >> Cc: Ross Zwisler <zwisler@kernel.org> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >> --- >> drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + >> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- >> drivers/gpu/drm/msm/msm_gpu.h | 3 +++ >> 3 files changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c >> index f0cff62812c3..03d912dc0130 100644 >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c >> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) >> { >> struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); >> + gpu->is_shutdown = true; >> WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); >> } >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >> index 382fb7f9e497..6903c6892469 100644 >> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c >> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, >> struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); >> /* No pointer params yet */ >> - if (*len != 0) >> + if (*len != 0 || gpu->is_shutdown) >> return -EINVAL; > This will race with shutdown. Could you clarify what you mean? At this point in the code, the shutdown is completed and it crashes here. > Probably, propagating back the return value of pm_runtime_get() in every possible ioctl call path is the right thing to do. Ok I’ll look into that. But the patch I posted works reliably and fixes all crashes we could reproduce. > I have never thought about this scenario. Do you know why userspace is not freezed before kexec? I am not sure. It depends on how kexec is used. The userspace freeze happens only when kexec is called to switch back and forth between different kernels (persistence mode). In such scenario I believe the userspace has to be frozen and unfrozen. However for normal kexec, that does not happen. Thanks. > > -Akhil. >> switch (param) { >> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h >> index ff911e7305ce..f18b0a91442b 100644 >> --- a/drivers/gpu/drm/msm/msm_gpu.h >> +++ b/drivers/gpu/drm/msm/msm_gpu.h >> @@ -214,6 +214,9 @@ struct msm_gpu { >> /* does gpu need hw_init? */ >> bool needs_hw_init; >> + /* is the GPU shutdown? */ >> + bool is_shutdown; >> + >> /** >> * global_faults: number of GPU hangs not attributed to a particular >> * address space >
On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: > > Even though the GPU is shut down, during kexec reboot we can have userspace > > still running. This is especially true if KEXEC_JUMP is not enabled, because we > > do not freeze userspace in this case. > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from > > accessing GPU resources if we find it shutdown. > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: > > > > [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt > > [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) > > [ 292.534326] Call trace: > > [ 292.534328] dump_backtrace+0x0/0x1d4 > > [ 292.534337] show_stack+0x20/0x2c > > [ 292.534342] dump_stack_lvl+0x60/0x78 > > [ 292.534347] dump_stack+0x18/0x38 > > [ 292.534352] panic+0x148/0x3b0 > > [ 292.534357] nmi_panic+0x80/0x94 > > [ 292.534364] arm64_serror_panic+0x70/0x7c > > [ 292.534369] do_serror+0x0/0x7c > > [ 292.534372] do_serror+0x54/0x7c > > [ 292.534377] el1h_64_error_handler+0x34/0x4c > > [ 292.534381] el1h_64_error+0x7c/0x80 > > [ 292.534386] el1_interrupt+0x20/0x58 > > [ 292.534389] el1h_64_irq_handler+0x18/0x24 > > [ 292.534395] el1h_64_irq+0x7c/0x80 > > [ 292.534399] local_daif_inherit+0x10/0x18 > > [ 292.534405] el1h_64_sync_handler+0x48/0xb4 > > [ 292.534410] el1h_64_sync+0x7c/0x80 > > [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc > > [ 292.534422] a6xx_get_timestamp+0x40/0xb4 > > [ 292.534426] adreno_get_param+0x12c/0x1e0 > > [ 292.534433] msm_ioctl_get_param+0x64/0x70 > > [ 292.534440] drm_ioctl_kernel+0xe8/0x158 > > [ 292.534448] drm_ioctl+0x208/0x320 > > [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 > > [ 292.534461] invoke_syscall+0x4c/0x118 > > [ 292.534467] el0_svc_common+0x98/0x104 > > [ 292.534473] do_el0_svc+0x30/0x80 > > [ 292.534478] el0_svc+0x20/0x50 > > [ 292.534481] el0t_64_sync_handler+0x78/0x108 > > [ 292.534485] el0t_64_sync+0x1a4/0x1a8 > > [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 > > [ 292.534635] PHYS_OFFSET: 0x80000000 > > [ 292.534638] CPU features: 0x40018541,a3300e42 > > [ 292.534644] Memory Limit: none > > > > Cc: Rob Clark <robdclark@chromium.org> > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Cc: Ricardo Ribalda <ribalda@chromium.org> > > Cc: Ross Zwisler <zwisler@kernel.org> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- > > drivers/gpu/drm/msm/msm_gpu.h | 3 +++ > > 3 files changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > > index f0cff62812c3..03d912dc0130 100644 > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) > > { > > struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); > > > > + gpu->is_shutdown = true; > > WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); > > } > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > index 382fb7f9e497..6903c6892469 100644 > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, > > struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); > > > > /* No pointer params yet */ > > - if (*len != 0) > > + if (*len != 0 || gpu->is_shutdown) > > return -EINVAL; > This will race with shutdown. Probably, propagating back the return > value of pm_runtime_get() in every possible ioctl call path is the right > thing to do. > > I have never thought about this scenario. Do you know why userspace is > not freezed before kexec? So userspace not being frozen seems like the root issue, and is likely to cause all sorts of other whack-a-mole problems. I guess I'd like to know if this is the expected behavior? If so, we should probably look at drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH that mechanism. We would need to sprinkle drm_dev_is_unplugged() checks more widely, and also ensure that the scheduler kthread(s) gets parked. But it would be nice to know first if we are just trying to paper over a kexec bug. BR, -R > > -Akhil. > > > > switch (param) { > > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > > index ff911e7305ce..f18b0a91442b 100644 > > --- a/drivers/gpu/drm/msm/msm_gpu.h > > +++ b/drivers/gpu/drm/msm/msm_gpu.h > > @@ -214,6 +214,9 @@ struct msm_gpu { > > /* does gpu need hw_init? */ > > bool needs_hw_init; > > > > + /* is the GPU shutdown? */ > > + bool is_shutdown; > > + > > /** > > * global_faults: number of GPU hangs not attributed to a particular > > * address space >
On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote: > > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: > > > Even though the GPU is shut down, during kexec reboot we can have userspace > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we > > > do not freeze userspace in this case. > > > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from > > > accessing GPU resources if we find it shutdown. > > > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: > > > > > > [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt > > > [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) > > > [ 292.534326] Call trace: > > > [ 292.534328] dump_backtrace+0x0/0x1d4 > > > [ 292.534337] show_stack+0x20/0x2c > > > [ 292.534342] dump_stack_lvl+0x60/0x78 > > > [ 292.534347] dump_stack+0x18/0x38 > > > [ 292.534352] panic+0x148/0x3b0 > > > [ 292.534357] nmi_panic+0x80/0x94 > > > [ 292.534364] arm64_serror_panic+0x70/0x7c > > > [ 292.534369] do_serror+0x0/0x7c > > > [ 292.534372] do_serror+0x54/0x7c > > > [ 292.534377] el1h_64_error_handler+0x34/0x4c > > > [ 292.534381] el1h_64_error+0x7c/0x80 > > > [ 292.534386] el1_interrupt+0x20/0x58 > > > [ 292.534389] el1h_64_irq_handler+0x18/0x24 > > > [ 292.534395] el1h_64_irq+0x7c/0x80 > > > [ 292.534399] local_daif_inherit+0x10/0x18 > > > [ 292.534405] el1h_64_sync_handler+0x48/0xb4 > > > [ 292.534410] el1h_64_sync+0x7c/0x80 > > > [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc > > > [ 292.534422] a6xx_get_timestamp+0x40/0xb4 > > > [ 292.534426] adreno_get_param+0x12c/0x1e0 > > > [ 292.534433] msm_ioctl_get_param+0x64/0x70 > > > [ 292.534440] drm_ioctl_kernel+0xe8/0x158 > > > [ 292.534448] drm_ioctl+0x208/0x320 > > > [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 > > > [ 292.534461] invoke_syscall+0x4c/0x118 > > > [ 292.534467] el0_svc_common+0x98/0x104 > > > [ 292.534473] do_el0_svc+0x30/0x80 > > > [ 292.534478] el0_svc+0x20/0x50 > > > [ 292.534481] el0t_64_sync_handler+0x78/0x108 > > > [ 292.534485] el0t_64_sync+0x1a4/0x1a8 > > > [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 > > > [ 292.534635] PHYS_OFFSET: 0x80000000 > > > [ 292.534638] CPU features: 0x40018541,a3300e42 > > > [ 292.534644] Memory Limit: none > > > > > > Cc: Rob Clark <robdclark@chromium.org> > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > Cc: Ricardo Ribalda <ribalda@chromium.org> > > > Cc: Ross Zwisler <zwisler@kernel.org> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > --- > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + > > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- > > > drivers/gpu/drm/msm/msm_gpu.h | 3 +++ > > > 3 files changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > index f0cff62812c3..03d912dc0130 100644 > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) > > > { > > > struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); > > > > > > + gpu->is_shutdown = true; > > > WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); > > > } > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > index 382fb7f9e497..6903c6892469 100644 > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, > > > struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); > > > > > > /* No pointer params yet */ > > > - if (*len != 0) > > > + if (*len != 0 || gpu->is_shutdown) > > > return -EINVAL; > > This will race with shutdown. Probably, propagating back the return > > value of pm_runtime_get() in every possible ioctl call path is the right > > thing to do. > > > > I have never thought about this scenario. Do you know why userspace is > > not freezed before kexec? > > So userspace not being frozen seems like the root issue, and is likely > to cause all sorts of other whack-a-mole problems. I guess I'd like > to know if this is the expected behavior? We tried that. Freezing before kexec seems to cause issues for ALSA as Ricardo found: https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/ That patch is still TBD due to disagreement on the right approach to fix, so I don't think freezing before shutting down devices is viable at the moment. I am checking Ricardo if we can do something like util-linux's shutdown code which sends SIGTERM to all processes: https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273 , before issuing the kexec-reboot. Maybe, a more graceful shutdown from kexec-lite, will prevent the kexec-reboot it does from crashing? Though in my view that would still be a small copout instead of fixing the real issue, which is the kernel crashing for any reason. > If so, we should probably look at > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH > that mechanism. We would need to sprinkle drm_dev_is_unplugged() > checks more widely, and also ensure that the scheduler kthread(s) gets > parked. But it would be nice to know first if we are just trying to > paper over a kexec bug. Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned above, works or not. I will await discussions with Ricardo before reposting that one. Cheers, -- Joel
On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote: > > > > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > > > > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: > > > > Even though the GPU is shut down, during kexec reboot we can have userspace > > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we > > > > do not freeze userspace in this case. > > > > > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from > > > > accessing GPU resources if we find it shutdown. > > > > > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: > > > > > > > > [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt > > > > [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) > > > > [ 292.534326] Call trace: > > > > [ 292.534328] dump_backtrace+0x0/0x1d4 > > > > [ 292.534337] show_stack+0x20/0x2c > > > > [ 292.534342] dump_stack_lvl+0x60/0x78 > > > > [ 292.534347] dump_stack+0x18/0x38 > > > > [ 292.534352] panic+0x148/0x3b0 > > > > [ 292.534357] nmi_panic+0x80/0x94 > > > > [ 292.534364] arm64_serror_panic+0x70/0x7c > > > > [ 292.534369] do_serror+0x0/0x7c > > > > [ 292.534372] do_serror+0x54/0x7c > > > > [ 292.534377] el1h_64_error_handler+0x34/0x4c > > > > [ 292.534381] el1h_64_error+0x7c/0x80 > > > > [ 292.534386] el1_interrupt+0x20/0x58 > > > > [ 292.534389] el1h_64_irq_handler+0x18/0x24 > > > > [ 292.534395] el1h_64_irq+0x7c/0x80 > > > > [ 292.534399] local_daif_inherit+0x10/0x18 > > > > [ 292.534405] el1h_64_sync_handler+0x48/0xb4 > > > > [ 292.534410] el1h_64_sync+0x7c/0x80 > > > > [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc > > > > [ 292.534422] a6xx_get_timestamp+0x40/0xb4 > > > > [ 292.534426] adreno_get_param+0x12c/0x1e0 > > > > [ 292.534433] msm_ioctl_get_param+0x64/0x70 > > > > [ 292.534440] drm_ioctl_kernel+0xe8/0x158 > > > > [ 292.534448] drm_ioctl+0x208/0x320 > > > > [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 > > > > [ 292.534461] invoke_syscall+0x4c/0x118 > > > > [ 292.534467] el0_svc_common+0x98/0x104 > > > > [ 292.534473] do_el0_svc+0x30/0x80 > > > > [ 292.534478] el0_svc+0x20/0x50 > > > > [ 292.534481] el0t_64_sync_handler+0x78/0x108 > > > > [ 292.534485] el0t_64_sync+0x1a4/0x1a8 > > > > [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 > > > > [ 292.534635] PHYS_OFFSET: 0x80000000 > > > > [ 292.534638] CPU features: 0x40018541,a3300e42 > > > > [ 292.534644] Memory Limit: none > > > > > > > > Cc: Rob Clark <robdclark@chromium.org> > > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org> > > > > Cc: Ross Zwisler <zwisler@kernel.org> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > --- > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + > > > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- > > > > drivers/gpu/drm/msm/msm_gpu.h | 3 +++ > > > > 3 files changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > index f0cff62812c3..03d912dc0130 100644 > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) > > > > { > > > > struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); > > > > > > > > + gpu->is_shutdown = true; > > > > WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); > > > > } > > > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > index 382fb7f9e497..6903c6892469 100644 > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, > > > > struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); > > > > > > > > /* No pointer params yet */ > > > > - if (*len != 0) > > > > + if (*len != 0 || gpu->is_shutdown) > > > > return -EINVAL; > > > This will race with shutdown. Probably, propagating back the return > > > value of pm_runtime_get() in every possible ioctl call path is the right > > > thing to do. > > > > > > I have never thought about this scenario. Do you know why userspace is > > > not freezed before kexec? > > > > So userspace not being frozen seems like the root issue, and is likely > > to cause all sorts of other whack-a-mole problems. I guess I'd like > > to know if this is the expected behavior? > > We tried that. Freezing before kexec seems to cause issues for ALSA as > Ricardo found: > https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/ > > That patch is still TBD due to disagreement on the right approach to > fix, so I don't think freezing before shutting down devices is viable > at the moment. > > I am checking Ricardo if we can do something like util-linux's > shutdown code which sends SIGTERM to all processes: > https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273 > , before issuing the kexec-reboot. > > Maybe, a more graceful shutdown from kexec-lite, will prevent the > kexec-reboot it does from crashing? Though in my view that would still > be a small copout instead of fixing the real issue, which is the > kernel crashing for any reason. The problem is that pm_runtime_force_suspend() yanks the rug out from under the driver from a runpm PoV.. generally this is ok (as long as scheduler kthread is parked) because we don't have to worry about userspace ;-) > > If so, we should probably look at > > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH > > that mechanism. We would need to sprinkle drm_dev_is_unplugged() > > checks more widely, and also ensure that the scheduler kthread(s) gets > > parked. But it would be nice to know first if we are just trying to > > paper over a kexec bug. > > Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned > above, works or not. I will await discussions with Ricardo before > reposting that one. Yeah, I think I'm waiting on a v2 of that one ;-) BR, -R > Cheers, > > -- Joel
On Thu, Dec 1, 2022 at 7:33 PM Rob Clark <robdclark@gmail.com> wrote: > > On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote: > > > > > > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > > > > > > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote: > > > > > Even though the GPU is shut down, during kexec reboot we can have userspace > > > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we > > > > > do not freeze userspace in this case. > > > > > > > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from > > > > > accessing GPU resources if we find it shutdown. > > > > > > > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU: > > > > > > > > > > [ 292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt > > > > > [ 292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT) > > > > > [ 292.534326] Call trace: > > > > > [ 292.534328] dump_backtrace+0x0/0x1d4 > > > > > [ 292.534337] show_stack+0x20/0x2c > > > > > [ 292.534342] dump_stack_lvl+0x60/0x78 > > > > > [ 292.534347] dump_stack+0x18/0x38 > > > > > [ 292.534352] panic+0x148/0x3b0 > > > > > [ 292.534357] nmi_panic+0x80/0x94 > > > > > [ 292.534364] arm64_serror_panic+0x70/0x7c > > > > > [ 292.534369] do_serror+0x0/0x7c > > > > > [ 292.534372] do_serror+0x54/0x7c > > > > > [ 292.534377] el1h_64_error_handler+0x34/0x4c > > > > > [ 292.534381] el1h_64_error+0x7c/0x80 > > > > > [ 292.534386] el1_interrupt+0x20/0x58 > > > > > [ 292.534389] el1h_64_irq_handler+0x18/0x24 > > > > > [ 292.534395] el1h_64_irq+0x7c/0x80 > > > > > [ 292.534399] local_daif_inherit+0x10/0x18 > > > > > [ 292.534405] el1h_64_sync_handler+0x48/0xb4 > > > > > [ 292.534410] el1h_64_sync+0x7c/0x80 > > > > > [ 292.534414] a6xx_gmu_set_oob+0xbc/0x1fc > > > > > [ 292.534422] a6xx_get_timestamp+0x40/0xb4 > > > > > [ 292.534426] adreno_get_param+0x12c/0x1e0 > > > > > [ 292.534433] msm_ioctl_get_param+0x64/0x70 > > > > > [ 292.534440] drm_ioctl_kernel+0xe8/0x158 > > > > > [ 292.534448] drm_ioctl+0x208/0x320 > > > > > [ 292.534453] __arm64_sys_ioctl+0x98/0xd0 > > > > > [ 292.534461] invoke_syscall+0x4c/0x118 > > > > > [ 292.534467] el0_svc_common+0x98/0x104 > > > > > [ 292.534473] do_el0_svc+0x30/0x80 > > > > > [ 292.534478] el0_svc+0x20/0x50 > > > > > [ 292.534481] el0t_64_sync_handler+0x78/0x108 > > > > > [ 292.534485] el0t_64_sync+0x1a4/0x1a8 > > > > > [ 292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000 > > > > > [ 292.534635] PHYS_OFFSET: 0x80000000 > > > > > [ 292.534638] CPU features: 0x40018541,a3300e42 > > > > > [ 292.534644] Memory Limit: none > > > > > > > > > > Cc: Rob Clark <robdclark@chromium.org> > > > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > > > Cc: Ricardo Ribalda <ribalda@chromium.org> > > > > > Cc: Ross Zwisler <zwisler@kernel.org> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > --- > > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 1 + > > > > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- > > > > > drivers/gpu/drm/msm/msm_gpu.h | 3 +++ > > > > > 3 files changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > > index f0cff62812c3..03d912dc0130 100644 > > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > > > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) > > > > > { > > > > > struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); > > > > > > > > > > + gpu->is_shutdown = true; > > > > > WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); > > > > > } > > > > > > > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > > index 382fb7f9e497..6903c6892469 100644 > > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, > > > > > struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); > > > > > > > > > > /* No pointer params yet */ > > > > > - if (*len != 0) > > > > > + if (*len != 0 || gpu->is_shutdown) > > > > > return -EINVAL; > > > > This will race with shutdown. Probably, propagating back the return > > > > value of pm_runtime_get() in every possible ioctl call path is the right > > > > thing to do. > > > > > > > > I have never thought about this scenario. Do you know why userspace is > > > > not freezed before kexec? > > > > > > So userspace not being frozen seems like the root issue, and is likely > > > to cause all sorts of other whack-a-mole problems. I guess I'd like > > > to know if this is the expected behavior? > > > > We tried that. Freezing before kexec seems to cause issues for ALSA as > > Ricardo found: > > https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/ > > > > That patch is still TBD due to disagreement on the right approach to > > fix, so I don't think freezing before shutting down devices is viable > > at the moment. > > > > I am checking Ricardo if we can do something like util-linux's > > shutdown code which sends SIGTERM to all processes: > > https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273 > > , before issuing the kexec-reboot. > > > > Maybe, a more graceful shutdown from kexec-lite, will prevent the > > kexec-reboot it does from crashing? Though in my view that would still > > be a small copout instead of fixing the real issue, which is the > > kernel crashing for any reason. > > The problem is that pm_runtime_force_suspend() yanks the rug out from > under the driver from a runpm PoV.. generally this is ok (as long as > scheduler kthread is parked) because we don't have to worry about > userspace ;-) > > > > If so, we should probably look at > > > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH > > > that mechanism. We would need to sprinkle drm_dev_is_unplugged() > > > checks more widely, and also ensure that the scheduler kthread(s) gets > > > parked. But it would be nice to know first if we are just trying to > > > paper over a kexec bug. > > > > Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned > > above, works or not. I will await discussions with Ricardo before > > reposting that one. > > Yeah, I think I'm waiting on a v2 of that one ;-) > Cool, I will send you a v2 on that shortly then, I think it is good to do the patch 1/2 change anyway. :-) Thanks, - Joel > > Cheers, > > > > -- Joel
diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index f0cff62812c3..03d912dc0130 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev) { struct msm_gpu *gpu = dev_to_gpu(&pdev->dev); + gpu->is_shutdown = true; WARN_ON_ONCE(adreno_system_suspend(&pdev->dev)); } diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 382fb7f9e497..6903c6892469 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx, struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); /* No pointer params yet */ - if (*len != 0) + if (*len != 0 || gpu->is_shutdown) return -EINVAL; switch (param) { diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index ff911e7305ce..f18b0a91442b 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -214,6 +214,9 @@ struct msm_gpu { /* does gpu need hw_init? */ bool needs_hw_init; + /* is the GPU shutdown? */ + bool is_shutdown; + /** * global_faults: number of GPU hangs not attributed to a particular * address space