Message ID | 20231102141507.73481-1-angelogioacchino.delregno@collabora.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp385635vqu; Thu, 2 Nov 2023 07:16:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEkvDsANOnu1uQPvswyO72OHR/nw6YY9dGElX7PnlNbPP9kKwEtHIJBMX04mnmRtjDMue/B X-Received: by 2002:a05:6a00:10d2:b0:693:3963:847a with SMTP id d18-20020a056a0010d200b006933963847amr18197370pfu.30.1698934583450; Thu, 02 Nov 2023 07:16:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698934583; cv=none; d=google.com; s=arc-20160816; b=YmJSJeKxDDHaqu8aGIiR6hOPJFoVjr3uAqCsOpN1kVEuQozQ1pBmyS0cNrIgE5EReE u7YXmpX4gWvFIesrBEoApPd6SYU7eJfKXbcc20dIaf/Ou3wKFtkqdyrNA/8Arrm6FpIB G+Auq3Pg1WV6i+14NuPrJEsa6Zu/Eze5hZydZ70755d9Zye7/Kee+wiDCARt/VybyWhJ PuAHkgSWIyKkyD//u6kvDK8HjQMTwIuvQ5cubcQZVzURTYWyVFrHQftbJj/8CADoQKTG EOfWkPN5/Bl+CqH49ZV4FvD1FgO5sb1vXPAkstZmZxoEgKeYi7yD02285sZkODlrDzNr Sk+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ypyD1qST4ZPJ2fUQFwzCaZv8z+KQxn38nFMeWF0Vn+E=; fh=tAWvjMAEmVD3R8FzmIRYzQIb+yW9J+gWZJ2R8deHkVw=; b=H/VTyx5Qy/4bs6PpfSEAFN7cnggbyunyLGhQqvhoAyMtEGQiX67NANbF8zDBIe5SV4 49jo2hQeOQVCSRsuS0kkranpPYQu+WKxbQb03qBtV0NJ9i/Rpt5d9breJYqEiKUU+vrq sXo25lQR3j9bLthAZdxfQ4plMWMlp2cSLJZNbDV+Ms3IJ1RxQBfFin9DYethBVjDaLau kuus0a05ckUOhqJcJzxaItphW52wgw/AdIkvAFjJs38OJcRbnfrl2y7W1knsvhlz4GZT 2CmB1KkPPWDoJdRTay0O0ZRJegqK6dFC3kvTN2a2Tk9oNezIRjG+C1hTgrWPBJU3xoEg s7FQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=HqpIaYF1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id y12-20020a056a00180c00b006c0fe926fc9si3819414pfa.173.2023.11.02.07.16.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Nov 2023 07:16:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=HqpIaYF1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 2C4508096791; Thu, 2 Nov 2023 07:15:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376715AbjKBOPZ (ORCPT <rfc822;lhua1029@gmail.com> + 36 others); Thu, 2 Nov 2023 10:15:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230024AbjKBOPY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 2 Nov 2023 10:15:24 -0400 Received: from madras.collabora.co.uk (madras.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e5ab]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1450A12E for <linux-kernel@vger.kernel.org>; Thu, 2 Nov 2023 07:15:15 -0700 (PDT) Received: from IcarusMOD.eternityproject.eu (cola.collaboradmins.com [195.201.22.229]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: kholk11) by madras.collabora.co.uk (Postfix) with ESMTPSA id 203C06607033; Thu, 2 Nov 2023 14:15:13 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1698934514; bh=cAv8iDJbPkwyr97yslzoCR0E+J9ioopOVLD7tn0xeaE=; h=From:To:Cc:Subject:Date:From; b=HqpIaYF1rO3cGnHd2r+erpo4hV2GzGpXpvcUY+S6BvEdi0pqjEu5FWy6QdvWTB3Gn sLX5M2x5Eba62EWJQeIhhroPsAOPWMocc59q2AuWBWBlax4+QBZzkgfoOdyFqPQbXF Y7hMqhsMlQszW/jpI8+JDDIRReCRzoRS38k39ZND8OCmYQjwIAov7wyhLEOh3B4a8K B14ZdeSjkbBKNhWSqF/sa/1vcE8KOscGZlWKXVKeDxxVrAu7PA0NcjV6DjsU7+mLU0 /B6FVDPNyCX1zZQRdHj1Wv1SpIjqjXob6NB925dUEtepvVxB63XjzTVkIiW32H160F OWsLpYE84uuzA== From: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> To: boris.brezillon@collabora.com Cc: robh@kernel.org, steven.price@arm.com, maarten.lankhorst@linux.intel.com, mripard@kernel.org, tzimmermann@suse.de, airlied@gmail.com, daniel@ffwll.ch, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, wenst@chromium.org, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>, kernel@collabora.com Subject: [PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() Date: Thu, 2 Nov 2023 15:15:07 +0100 Message-ID: <20231102141507.73481-1-angelogioacchino.delregno@collabora.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Thu, 02 Nov 2023 07:15:51 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781462030127409950 X-GMAIL-MSGID: 1781462030127409950 |
Series |
drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()
|
|
Commit Message
AngeloGioacchino Del Regno
Nov. 2, 2023, 2:15 p.m. UTC
The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
this means that in order to request poweroff of cores, we are supposed
to write a bitmask of cores that should be powered off!
This means that the panfrost_gpu_power_off() function has always been
doing nothing.
Fix powering off the GPU by writing a bitmask of the cores to poweroff
to the relevant PWROFF_LO registers and then check that the transition
(from ON to OFF) has finished by polling the relevant PWRTRANS_LO
registers.
While at it, in order to avoid code duplication, move the core mask
logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
function, used in both poweron and poweroff.
Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
---
drivers/gpu/drm/panfrost/panfrost_gpu.c | 65 ++++++++++++++++++-------
1 file changed, 47 insertions(+), 18 deletions(-)
Comments
On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: > The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request > powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: > this means that in order to request poweroff of cores, we are supposed > to write a bitmask of cores that should be powered off! > This means that the panfrost_gpu_power_off() function has always been > doing nothing. > > Fix powering off the GPU by writing a bitmask of the cores to poweroff > to the relevant PWROFF_LO registers and then check that the transition > (from ON to OFF) has finished by polling the relevant PWRTRANS_LO > registers. > > While at it, in order to avoid code duplication, move the core mask > logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() > function, used in both poweron and poweroff. > > Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") > Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Thanks, Steve
On 08/11/2023 14:20, Steven Price wrote: > On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >> this means that in order to request poweroff of cores, we are supposed >> to write a bitmask of cores that should be powered off! >> This means that the panfrost_gpu_power_off() function has always been >> doing nothing. >> >> Fix powering off the GPU by writing a bitmask of the cores to poweroff >> to the relevant PWROFF_LO registers and then check that the transition >> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >> registers. >> >> While at it, in order to avoid code duplication, move the core mask >> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >> function, used in both poweron and poweroff. >> >> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Hi, This commit was added to next recently but it causes "external abort on non-linefetch" during boot of my Odroid HC1 board. At least bisect points to it. If fixed, please add: Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> [ 4.861683] 8<--- cut here --- [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 ... [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 Full log: https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 1. exynos_defconfig 2. HW: Odroid HC1 ARMv7, octa-core (Cortex-A7+A15), Exynos5422 SoC arm,mali-t628 Bisect log: git bisect start # bad: [07b677953b9dca02928be323e2db853511305fa9] Add linux-next specific files for 20231121 git bisect bad 07b677953b9dca02928be323e2db853511305fa9 # good: [98b1cc82c4affc16f5598d4fa14b1858671b2263] Linux 6.7-rc2 git bisect good 98b1cc82c4affc16f5598d4fa14b1858671b2263 # good: [13e2401d5bdc7f5a30f2651c99f0e3374cdda815] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git git bisect good 13e2401d5bdc7f5a30f2651c99f0e3374cdda815 # bad: [3b586cd6d8e51c428675312e7c3f634eb96337e9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git git bisect bad 3b586cd6d8e51c428675312e7c3f634eb96337e9 # bad: [9d63fd5f05248c78d9a66ce5dbc9cf5649054848] Merge branch 'drm-next' of https://gitlab.freedesktop.org/agd5f/linux git bisect bad 9d63fd5f05248c78d9a66ce5dbc9cf5649054848 # bad: [5dea0c3fedee65413271a5700e653eff633e9a7f] drm/panel-elida-kd35t133: Drop shutdown logic git bisect bad 5dea0c3fedee65413271a5700e653eff633e9a7f # good: [48d45fac3940347becd290b96b2fc6d5ad8171f7] accel/ivpu: Remove support for uncached buffers git bisect good 48d45fac3940347becd290b96b2fc6d5ad8171f7 # bad: [809ef191ee600e8bcbe2f8a769e00d2d54c16094] drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm git bisect bad 809ef191ee600e8bcbe2f8a769e00d2d54c16094 # good: [a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f] drm/sched: implement dynamic job-flow control git bisect good a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f # bad: [e4178256094a76cc36d9b9aabe7482615959b26f] drm/virtio: use uint64_t more in virtio_gpu_context_init_ioctl git bisect bad e4178256094a76cc36d9b9aabe7482615959b26f # bad: [56e76c0179185568049913257c18069293f8bde9] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend git bisect bad 56e76c0179185568049913257c18069293f8bde9 # bad: [57d4e26717b030fd794df3534e6b2e806eb761e4] drm/panfrost: Perform hard reset to recover GPU if soft reset fails git bisect bad 57d4e26717b030fd794df3534e6b2e806eb761e4 # bad: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() git bisect bad 22aa1a209018dc2eca78745f7666db63637cd5dc # first bad commit: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() Best regards, Krzysztof
Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: > On 08/11/2023 14:20, Steven Price wrote: >> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>> this means that in order to request poweroff of cores, we are supposed >>> to write a bitmask of cores that should be powered off! >>> This means that the panfrost_gpu_power_off() function has always been >>> doing nothing. >>> >>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>> to the relevant PWROFF_LO registers and then check that the transition >>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>> registers. >>> >>> While at it, in order to avoid code duplication, move the core mask >>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>> function, used in both poweron and poweroff. >>> >>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > > > Hi, > > This commit was added to next recently but it causes "external abort on > non-linefetch" during boot of my Odroid HC1 board. > > At least bisect points to it. > > If fixed, please add: > > Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > > [ 4.861683] 8<--- cut here --- > [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c > [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 > ... > [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c > [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 > [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 > [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 > [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 > [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 > [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 > > Full log: > https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 > Hey Krzysztof, This is interesting. It might be about the cores that are missing from the partial core_mask raising interrupts, but an external abort on non-linefetch is strange to see here. Would you be available for some tests? I'm thinking to call power_off on all cores (all shaders, all tilers, all l2s), regardless of what panfrost_get_core_mask() says, as it could be that your GPU powers on the cores that are unused by Panfrost by default, and that then we never turn them off, escalating to this issue. If you can please please please test: void panfrost_gpu_power_off(struct panfrost_device *pfdev) { u64 core_mask = panfrost_get_core_mask(pfdev); int ret; u32 val; gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); gpu_write(pfdev, SHADER_PWROFF_HI, pfdev->features.shader_present >> 32); ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, val, !val, 1, 1000); if (ret) dev_err(pfdev->dev, "shader power transition timeout"); gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); gpu_write(pfdev, TILER_PWROFF_HI, pfdev->features.tiler_present >> 32); ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, val, !val, 1, 1000); if (ret) dev_err(pfdev->dev, "tiler power transition timeout"); gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, val, !val, 0, 1000); if (ret) dev_err(pfdev->dev, "l2 power transition timeout"); gpu_write(pfdev, L2_PWROFF_HI, pfdev->features.l2_present >> 32); ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, val, !val, 0, 1000); if (ret) dev_err(pfdev->dev, "l2 power transition timeout"); } This is a quick hack that might work. If this does actually work, the real fix will be to PWROFF the extra cores only once, at panfrost_gpu_init() time, before calling panfrost_gpu_power_on(), and to leave them disabled forever (until Panfrost actually gets to support those extra cores for real). Thanks, Angelo > 1. exynos_defconfig > 2. HW: Odroid HC1 > ARMv7, octa-core (Cortex-A7+A15), Exynos5422 SoC > arm,mali-t628 > > Bisect log: > > git bisect start > # bad: [07b677953b9dca02928be323e2db853511305fa9] Add linux-next specific files for 20231121 > git bisect bad 07b677953b9dca02928be323e2db853511305fa9 > # good: [98b1cc82c4affc16f5598d4fa14b1858671b2263] Linux 6.7-rc2 > git bisect good 98b1cc82c4affc16f5598d4fa14b1858671b2263 > # good: [13e2401d5bdc7f5a30f2651c99f0e3374cdda815] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git > git bisect good 13e2401d5bdc7f5a30f2651c99f0e3374cdda815 > # bad: [3b586cd6d8e51c428675312e7c3f634eb96337e9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git > git bisect bad 3b586cd6d8e51c428675312e7c3f634eb96337e9 > # bad: [9d63fd5f05248c78d9a66ce5dbc9cf5649054848] Merge branch 'drm-next' of https://gitlab.freedesktop.org/agd5f/linux > git bisect bad 9d63fd5f05248c78d9a66ce5dbc9cf5649054848 > # bad: [5dea0c3fedee65413271a5700e653eff633e9a7f] drm/panel-elida-kd35t133: Drop shutdown logic > git bisect bad 5dea0c3fedee65413271a5700e653eff633e9a7f > # good: [48d45fac3940347becd290b96b2fc6d5ad8171f7] accel/ivpu: Remove support for uncached buffers > git bisect good 48d45fac3940347becd290b96b2fc6d5ad8171f7 > # bad: [809ef191ee600e8bcbe2f8a769e00d2d54c16094] drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm > git bisect bad 809ef191ee600e8bcbe2f8a769e00d2d54c16094 > # good: [a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f] drm/sched: implement dynamic job-flow control > git bisect good a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f > # bad: [e4178256094a76cc36d9b9aabe7482615959b26f] drm/virtio: use uint64_t more in virtio_gpu_context_init_ioctl > git bisect bad e4178256094a76cc36d9b9aabe7482615959b26f > # bad: [56e76c0179185568049913257c18069293f8bde9] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend > git bisect bad 56e76c0179185568049913257c18069293f8bde9 > # bad: [57d4e26717b030fd794df3534e6b2e806eb761e4] drm/panfrost: Perform hard reset to recover GPU if soft reset fails > git bisect bad 57d4e26717b030fd794df3534e6b2e806eb761e4 > # bad: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() > git bisect bad 22aa1a209018dc2eca78745f7666db63637cd5dc > # first bad commit: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() > > > Best regards, > Krzysztof >
On 21/11/2023 17:11, AngeloGioacchino Del Regno wrote: > Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >> On 08/11/2023 14:20, Steven Price wrote: >>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>>> this means that in order to request poweroff of cores, we are supposed >>>> to write a bitmask of cores that should be powered off! >>>> This means that the panfrost_gpu_power_off() function has always been >>>> doing nothing. >>>> >>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>>> to the relevant PWROFF_LO registers and then check that the transition >>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>> registers. >>>> >>>> While at it, in order to avoid code duplication, move the core mask >>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>> function, used in both poweron and poweroff. >>>> >>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >> >> >> Hi, >> >> This commit was added to next recently but it causes "external abort on >> non-linefetch" during boot of my Odroid HC1 board. >> >> At least bisect points to it. >> >> If fixed, please add: >> >> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >> >> [ 4.861683] 8<--- cut here --- >> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c >> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >> ... >> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c >> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 >> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 >> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 >> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >> >> Full log: >> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >> > > Hey Krzysztof, > > This is interesting. It might be about the cores that are missing from the partial > core_mask raising interrupts, but an external abort on non-linefetch is strange to > see here. > > Would you be available for some tests? > > I'm thinking to call power_off on all cores (all shaders, all tilers, all l2s), > regardless of what panfrost_get_core_mask() says, as it could be that your GPU > powers on the cores that are unused by Panfrost by default, and that then we never > turn them off, escalating to this issue. > > If you can please please please test: > > void panfrost_gpu_power_off(struct panfrost_device *pfdev) > { > u64 core_mask = panfrost_get_core_mask(pfdev); > int ret; > u32 val; > > gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); > gpu_write(pfdev, SHADER_PWROFF_HI, pfdev->features.shader_present >> 32); > ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "shader power transition timeout"); > > gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); > gpu_write(pfdev, TILER_PWROFF_HI, pfdev->features.tiler_present >> 32); > ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "tiler power transition timeout"); > > gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); > ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, > val, !val, 0, 1000); > if (ret) > dev_err(pfdev->dev, "l2 power transition timeout"); > > gpu_write(pfdev, L2_PWROFF_HI, pfdev->features.l2_present >> 32); > ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, > val, !val, 0, 1000); > if (ret) > dev_err(pfdev->dev, "l2 power transition timeout"); > } > Send a diff please - against next or some other commit sha from next. Best regards, Krzysztof
On Tue, 21 Nov 2023 17:11:42 +0100 AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> wrote: > Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: > > On 08/11/2023 14:20, Steven Price wrote: > >> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: > >>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request > >>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: > >>> this means that in order to request poweroff of cores, we are supposed > >>> to write a bitmask of cores that should be powered off! > >>> This means that the panfrost_gpu_power_off() function has always been > >>> doing nothing. > >>> > >>> Fix powering off the GPU by writing a bitmask of the cores to poweroff > >>> to the relevant PWROFF_LO registers and then check that the transition > >>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO > >>> registers. > >>> > >>> While at it, in order to avoid code duplication, move the core mask > >>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() > >>> function, used in both poweron and poweroff. > >>> > >>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") > >>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > > > > > > Hi, > > > > This commit was added to next recently but it causes "external abort on > > non-linefetch" during boot of my Odroid HC1 board. > > > > At least bisect points to it. > > > > If fixed, please add: > > > > Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > > > > [ 4.861683] 8<--- cut here --- > > [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c > > [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 > > ... > > [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c > > [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 > > [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 > > [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 > > [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 > > [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 > > [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 > > > > Full log: > > https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 > > > > Hey Krzysztof, > > This is interesting. It might be about the cores that are missing from the partial > core_mask raising interrupts, but an external abort on non-linefetch is strange to > see here. I've seen such external aborts in the past, and the fault type has often been misleading. It's unlikely to have anything to do with a non-linefetch access, but it might be caused by a register access after the clock or power domain driving the register bank has been disabled. The following diff might help validate this theory. If that works, we probably want to make sure we synchronize IRQs before disabling in the suspend path. --->8--- diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h index 55ec807550b3..98df66e5cc9b 100644 --- a/drivers/gpu/drm/panfrost/panfrost_regs.h +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h @@ -34,8 +34,6 @@ (GPU_IRQ_FAULT |\ GPU_IRQ_MULTIPLE_FAULT |\ GPU_IRQ_RESET_COMPLETED |\ - GPU_IRQ_POWER_CHANGED |\ - GPU_IRQ_POWER_CHANGED_ALL |\ GPU_IRQ_PERFCNT_SAMPLE_COMPLETED |\ GPU_IRQ_CLEAN_CACHES_COMPLETED) #define GPU_IRQ_MASK_ERROR \
On 21/11/2023 17:55, Boris Brezillon wrote: > On Tue, 21 Nov 2023 17:11:42 +0100 > AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > wrote: > >> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>> On 08/11/2023 14:20, Steven Price wrote: >>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>>>> this means that in order to request poweroff of cores, we are supposed >>>>> to write a bitmask of cores that should be powered off! >>>>> This means that the panfrost_gpu_power_off() function has always been >>>>> doing nothing. >>>>> >>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>>>> to the relevant PWROFF_LO registers and then check that the transition >>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>> registers. >>>>> >>>>> While at it, in order to avoid code duplication, move the core mask >>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>> function, used in both poweron and poweroff. >>>>> >>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>> >>> >>> Hi, >>> >>> This commit was added to next recently but it causes "external abort on >>> non-linefetch" during boot of my Odroid HC1 board. >>> >>> At least bisect points to it. >>> >>> If fixed, please add: >>> >>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>> >>> [ 4.861683] 8<--- cut here --- >>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c >>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>> ... >>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c >>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 >>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 >>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 >>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>> >>> Full log: >>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>> >> >> Hey Krzysztof, >> >> This is interesting. It might be about the cores that are missing from the partial >> core_mask raising interrupts, but an external abort on non-linefetch is strange to >> see here. > > I've seen such external aborts in the past, and the fault type has > often been misleading. It's unlikely to have anything to do with a Yeah, often accessing device with power or clocks gated. > non-linefetch access, but it might be caused by a register access after > the clock or power domain driving the register bank has been disabled. > The following diff might help validate this theory. If that works, we > probably want to make sure we synchronize IRQs before disabling in the > suspend path. > > --->8--- > diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h > index 55ec807550b3..98df66e5cc9b 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_regs.h > +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h > @@ -34,8 +34,6 @@ > (GPU_IRQ_FAULT |\ > GPU_IRQ_MULTIPLE_FAULT |\ > GPU_IRQ_RESET_COMPLETED |\ > - GPU_IRQ_POWER_CHANGED |\ > - GPU_IRQ_POWER_CHANGED_ALL |\ This helped, at least for this issue (next-20231121). Much later in user-space boot I have lockups: watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61] [ 56.329224] smp_call_function_single from __sync_rcu_exp_select_node_cpus+0x29c/0x78c [ 56.337111] __sync_rcu_exp_select_node_cpus from sync_rcu_exp_select_cpus+0x334/0x878 [ 56.344995] sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18 [ 56.351231] wait_rcu_exp_gp from process_one_work+0x20c/0x620 [ 56.357038] process_one_work from worker_thread+0x1d0/0x488 [ 56.362668] worker_thread from kthread+0x104/0x138 [ 56.367521] kthread from ret_from_fork+0x14/0x28 But anyway the external abort does not appear. Best regards, Krzysztof
On Tue, 21 Nov 2023 18:08:44 +0100 Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> wrote: > > non-linefetch access, but it might be caused by a register access after > > the clock or power domain driving the register bank has been disabled. > > The following diff might help validate this theory. If that works, we > > probably want to make sure we synchronize IRQs before disabling in the > > suspend path. > > > > --->8--- > > diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h > > index 55ec807550b3..98df66e5cc9b 100644 > > --- a/drivers/gpu/drm/panfrost/panfrost_regs.h > > +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h > > @@ -34,8 +34,6 @@ > > (GPU_IRQ_FAULT |\ > > GPU_IRQ_MULTIPLE_FAULT |\ > > GPU_IRQ_RESET_COMPLETED |\ > > - GPU_IRQ_POWER_CHANGED |\ > > - GPU_IRQ_POWER_CHANGED_ALL |\ > > This helped, at least for this issue (next-20231121). Much later in > user-space boot I have lockups: > watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61] Hm, if this doesn't happen with "drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()" reverted, it might be related to the issue Angelo was describing, though I'd expect it to lead to job timeouts, not the sort of soft lockup reported here. > > [ 56.329224] smp_call_function_single from > __sync_rcu_exp_select_node_cpus+0x29c/0x78c > [ 56.337111] __sync_rcu_exp_select_node_cpus from > sync_rcu_exp_select_cpus+0x334/0x878 > [ 56.344995] sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18 > [ 56.351231] wait_rcu_exp_gp from process_one_work+0x20c/0x620 > [ 56.357038] process_one_work from worker_thread+0x1d0/0x488 > [ 56.362668] worker_thread from kthread+0x104/0x138 > [ 56.367521] kthread from ret_from_fork+0x14/0x28 > > But anyway the external abort does not appear. Thanks for testing! As I said in my previous reply, I think the proper fix for this particular issue would be to mask all panfrost IRQs (writing 0 to xxx_INT_MASK) + call synchronize_irq() from panfrost_device_suspend(), to make sure pending interrupts are flushed and the handlers can't be called again (or at least not with real interrupts to process, if the IRQ line is shared) until the device is resumed. FWIW, after fighting with annoying bugs in the interrupt handling logic and its interactions with runtime PM, I've decided to automate some of this in panthor [1]. Also made the power on/off logic generic [2], with macros to allow powering on/off specific blocks. Not saying we should do that in panfrost, just posting it as a reference, in case someone is interested. [1]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_device.h?ref_type=heads#L279 [2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_gpu.c?ref_type=heads#L279 [3]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_gpu.h?ref_type=heads#L24
Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: > On 21/11/2023 17:55, Boris Brezillon wrote: >> On Tue, 21 Nov 2023 17:11:42 +0100 >> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >> wrote: >> >>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>>> On 08/11/2023 14:20, Steven Price wrote: >>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>>>>> this means that in order to request poweroff of cores, we are supposed >>>>>> to write a bitmask of cores that should be powered off! >>>>>> This means that the panfrost_gpu_power_off() function has always been >>>>>> doing nothing. >>>>>> >>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>>>>> to the relevant PWROFF_LO registers and then check that the transition >>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>>> registers. >>>>>> >>>>>> While at it, in order to avoid code duplication, move the core mask >>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>>> function, used in both poweron and poweroff. >>>>>> >>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>>> >>>> >>>> Hi, >>>> >>>> This commit was added to next recently but it causes "external abort on >>>> non-linefetch" during boot of my Odroid HC1 board. >>>> >>>> At least bisect points to it. >>>> >>>> If fixed, please add: >>>> >>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>>> >>>> [ 4.861683] 8<--- cut here --- >>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c >>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>>> ... >>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c >>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 >>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 >>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 >>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>>> >>>> Full log: >>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>>> >>> >>> Hey Krzysztof, >>> >>> This is interesting. It might be about the cores that are missing from the partial >>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>> see here. >> >> I've seen such external aborts in the past, and the fault type has >> often been misleading. It's unlikely to have anything to do with a > > Yeah, often accessing device with power or clocks gated. > Except my commit does *not* gate SoC power, nor SoC clocks 🙂 What the "Really power off ..." commit does is to ask the GPU to internally power off the shaders, tilers and L2, that's why I say that it is strange to see that kind of abort. The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} registers should still be accessible even with shaders, tilers and cache OFF. Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also work, but that'd add up quite a bit of latency on the runtime_suspend() call, so in this case I'd be more for avoiding to execute any register r/w in the handler by either checking if the GPU is supposed to be OFF, or clearing interrupts, which may not work if those are generated after the execution of the poweroff function. Or we could simply disable the irq after power_off, but that'd be hacky (as well). Let's see if asking to poweroff *everything* works: --- drivers/gpu/drm/panfrost/panfrost_gpu.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index 09f5e1563ebd..1c7276aaa182 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -429,21 +429,29 @@ void panfrost_gpu_power_off(struct panfrost_device *pfdev) int ret; u32 val; - gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask); + gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); + gpu_write(pfdev, SHADER_PWROFF_HI, U32_MAX); ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, val, !val, 1, 1000); if (ret) dev_err(pfdev->dev, "shader power transition timeout"); gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); + gpu_write(pfdev, TILER_PWROFF_HI, U32_MAX); ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, val, !val, 1, 1000); if (ret) dev_err(pfdev->dev, "tiler power transition timeout"); - gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask); + gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, - val, !val, 0, 1000); + val, !val, 0, 1000); + if (ret) + dev_err(pfdev->dev, "l2_low power transition timeout"); + + gpu_write(pfdev, L2_PWROFF_HI, U32_MAX); + ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, + val, !val, 0, 1000); if (ret) dev_err(pfdev->dev, "l2 power transition timeout"); }
On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: >>>>> >>>> >>>> Hey Krzysztof, >>>> >>>> This is interesting. It might be about the cores that are missing from the partial >>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>>> see here. >>> >>> I've seen such external aborts in the past, and the fault type has >>> often been misleading. It's unlikely to have anything to do with a >> >> Yeah, often accessing device with power or clocks gated. >> > > Except my commit does *not* gate SoC power, nor SoC clocks 🙂 It could be that something (like clocks or power supplies) was missing on this board/SoC, which was not critical till your patch came. > > What the "Really power off ..." commit does is to ask the GPU to internally power > off the shaders, tilers and L2, that's why I say that it is strange to see that > kind of abort. > > The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} > registers should still be accessible even with shaders, tilers and cache OFF. > > Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also > work, but that'd add up quite a bit of latency on the runtime_suspend() call, so > in this case I'd be more for avoiding to execute any register r/w in the handler > by either checking if the GPU is supposed to be OFF, or clearing interrupts, which > may not work if those are generated after the execution of the poweroff function. > Or we could simply disable the irq after power_off, but that'd be hacky (as well). > > > Let's see if asking to poweroff *everything* works: Worked. Best regards, Krzysztof
On 22/11/2023 09:06, AngeloGioacchino Del Regno wrote: > Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: >> On 21/11/2023 17:55, Boris Brezillon wrote: >>> On Tue, 21 Nov 2023 17:11:42 +0100 >>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>> wrote: >>> >>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>>>> On 08/11/2023 14:20, Steven Price wrote: >>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to >>>>>>> request >>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO >>>>>>> ones: >>>>>>> this means that in order to request poweroff of cores, we are >>>>>>> supposed >>>>>>> to write a bitmask of cores that should be powered off! >>>>>>> This means that the panfrost_gpu_power_off() function has always >>>>>>> been >>>>>>> doing nothing. >>>>>>> >>>>>>> Fix powering off the GPU by writing a bitmask of the cores to >>>>>>> poweroff >>>>>>> to the relevant PWROFF_LO registers and then check that the >>>>>>> transition >>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>>>> registers. >>>>>>> >>>>>>> While at it, in order to avoid code duplication, move the core mask >>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>>>> function, used in both poweron and poweroff. >>>>>>> >>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>>>> Signed-off-by: AngeloGioacchino Del Regno >>>>>>> <angelogioacchino.delregno@collabora.com> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> This commit was added to next recently but it causes "external >>>>> abort on >>>>> non-linefetch" during boot of my Odroid HC1 board. >>>>> >>>>> At least bisect points to it. >>>>> >>>>> If fixed, please add: >>>>> >>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>>>> >>>>> [ 4.861683] 8<--- cut here --- >>>>> [ 4.863429] Unhandled fault: external abort on non-linefetch >>>>> (0x1008) at 0xf0c8802c >>>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>>>> ... >>>>> [ 5.164010] panfrost_gpu_irq_handler from >>>>> __handle_irq_event_percpu+0xcc/0x31c >>>>> [ 5.171276] __handle_irq_event_percpu from >>>>> handle_irq_event+0x38/0x80 >>>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>>>> [ 5.183743] handle_fasteoi_irq from >>>>> generic_handle_domain_irq+0x28/0x38 >>>>> [ 5.190417] generic_handle_domain_irq from >>>>> gic_handle_irq+0x88/0xa8 >>>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>>>> >>>>> Full log: >>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>>>> >>>> >>>> Hey Krzysztof, >>>> >>>> This is interesting. It might be about the cores that are missing >>>> from the partial >>>> core_mask raising interrupts, but an external abort on non-linefetch >>>> is strange to >>>> see here. >>> >>> I've seen such external aborts in the past, and the fault type has >>> often been misleading. It's unlikely to have anything to do with a >> >> Yeah, often accessing device with power or clocks gated. >> > > Except my commit does *not* gate SoC power, nor SoC clocks 🙂 > > What the "Really power off ..." commit does is to ask the GPU to > internally power > off the shaders, tilers and L2, that's why I say that it is strange to > see that > kind of abort. > > The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and > GPU_FAULT_ADDRESS_{HI/LO} > registers should still be accessible even with shaders, tilers and cache > OFF. > > Anyway, yes, synchronizing IRQs before calling the poweroff sequence > would also > work, but that'd add up quite a bit of latency on the runtime_suspend() > call, so > in this case I'd be more for avoiding to execute any register r/w in the > handler > by either checking if the GPU is supposed to be OFF, or clearing > interrupts, which > may not work if those are generated after the execution of the poweroff > function. > Or we could simply disable the irq after power_off, but that'd be hacky > (as well). > > > Let's see if asking to poweroff *everything* works: > > > --- > drivers/gpu/drm/panfrost/panfrost_gpu.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c > b/drivers/gpu/drm/panfrost/panfrost_gpu.c > index 09f5e1563ebd..1c7276aaa182 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c > +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c > @@ -429,21 +429,29 @@ void panfrost_gpu_power_off(struct panfrost_device > *pfdev) > int ret; > u32 val; > > - gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & > core_mask); > + gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); Hopefully this one line change, and... > + gpu_write(pfdev, SHADER_PWROFF_HI, U32_MAX); > ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "shader power transition timeout"); > > gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); > + gpu_write(pfdev, TILER_PWROFF_HI, U32_MAX); > ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "tiler power transition timeout"); > > - gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & > core_mask); > + gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); ... this one are all that are actually needed - the rest should be ignored as they affect cores that aren't present. The Exynos 5422 SoC has a T628 MP6 - so two core groups which isn't a particularly well supported configuration. But I'm not sure how we're ending up with the second core group being powered up in the first place. Even if it was left powered by something previous (e.g. the bootloader) then the soft-reset during probe should cause them to power down. But it seems like a good idea to power off everything when powering down, even if we didn't expect the cores to be on. Boris also has a point that before cutting the power/clocks we should really be synchronising with the IRQs - but that affects the follow on patches not this one. Steve > ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, > - val, !val, 0, 1000); > + val, !val, 0, 1000); > + if (ret) > + dev_err(pfdev->dev, "l2_low power transition timeout"); > + > + gpu_write(pfdev, L2_PWROFF_HI, U32_MAX); > + ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, > + val, !val, 0, 1000); > if (ret) > dev_err(pfdev->dev, "l2 power transition timeout"); > }
Hi Angelo, On Wed, 22 Nov 2023 10:06:19 +0100 AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> wrote: > Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: > > On 21/11/2023 17:55, Boris Brezillon wrote: > >> On Tue, 21 Nov 2023 17:11:42 +0100 > >> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > >> wrote: > >> > >>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: > >>>> On 08/11/2023 14:20, Steven Price wrote: > >>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: > >>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request > >>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: > >>>>>> this means that in order to request poweroff of cores, we are supposed > >>>>>> to write a bitmask of cores that should be powered off! > >>>>>> This means that the panfrost_gpu_power_off() function has always been > >>>>>> doing nothing. > >>>>>> > >>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff > >>>>>> to the relevant PWROFF_LO registers and then check that the transition > >>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO > >>>>>> registers. > >>>>>> > >>>>>> While at it, in order to avoid code duplication, move the core mask > >>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() > >>>>>> function, used in both poweron and poweroff. > >>>>>> > >>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") > >>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > >>>> > >>>> > >>>> Hi, > >>>> > >>>> This commit was added to next recently but it causes "external abort on > >>>> non-linefetch" during boot of my Odroid HC1 board. > >>>> > >>>> At least bisect points to it. > >>>> > >>>> If fixed, please add: > >>>> > >>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > >>>> > >>>> [ 4.861683] 8<--- cut here --- > >>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c > >>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 > >>>> ... > >>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c > >>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 > >>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 > >>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 > >>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 > >>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 > >>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 > >>>> > >>>> Full log: > >>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 > >>>> > >>> > >>> Hey Krzysztof, > >>> > >>> This is interesting. It might be about the cores that are missing from the partial > >>> core_mask raising interrupts, but an external abort on non-linefetch is strange to > >>> see here. > >> > >> I've seen such external aborts in the past, and the fault type has > >> often been misleading. It's unlikely to have anything to do with a > > > > Yeah, often accessing device with power or clocks gated. > > > > Except my commit does *not* gate SoC power, nor SoC clocks 🙂 It's not directly related to your commit, it's just a side effect. > > What the "Really power off ..." commit does is to ask the GPU to internally power > off the shaders, tilers and L2, that's why I say that it is strange to see that > kind of abort. > > The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} > registers should still be accessible even with shaders, tilers and cache OFF. It's not the power_off() call that's problematic, it's when it happens (the last thing called in panfrost_device_runtime_suspend()), and the fact it generates interrupts. Yes, you don't explicitly gate the clocks in panfrost_device_runtime_suspend(), but the PM layer does interact directly with power domains, and shutting down a power domain might result in other clks/components being gated, which might make the register bank inaccessible from the CPU. > > Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also > work, but that'd add up quite a bit of latency on the runtime_suspend() call, Really? In practice I'd expect no pending interrupts, other than the power transition ones, which are purely and simply ignored by the handler. If we had any other pending interrupts on suspend, we would have faced this problem before. To sum-up, I'd expect the extra latency to just be the overhead of the synchronize_irq() call, which, after looking at the code, shouldn't be such a big deal. > so > in this case I'd be more for avoiding to execute any register r/w in the handler > by either checking if the GPU is supposed to be OFF, Yes, that's an option, but I don't think that's enough (see below). > or clearing interrupts, The handler might have been called already when you clear the interrupt, and you'd still need to make sure the handler has returned before returning from panfrost_device_runtime_suspend() if you want to guarantee no one is touching the registers when the power domains are shutdown. > which > may not work if those are generated after the execution of the poweroff function. They are generated while you poll the register, but that doesn't guarantee they will be processed by the time you return from your power_off() function, which I think is exactly the problem we're facing here. > Or we could simply disable the irq after power_off, but that'd be hacky (as well). If by disabling the interrupt you mean calling disable_irq(), that would work if the irq lines were not declared as shared (IRQF_SHARED flag passed at request time). Beside, the latency of disable_irq() should be pretty much the same as synchronize_irq(), given synchronize_irq() from there. If by disabling the interrupt, you mean masking it with _INT_MASK, then, as said above, that's not enough. You need to make sure any handler that may have been called as a result of this interrupt, returns before you return from the suspend function, so you need some kind of synchronization. > > > Let's see if asking to poweroff *everything* works: It might slightly change the timing, making this problem disappear by chance (if the interrupt gets processed before power_off() returns), but it doesn't make the suspend logic more robust. We really have to guarantee that no one will touch the registers when we enter suspend, be it some interrupt handler, or any kind of deferred work. Again, none of this is a direct result of your patch, it's just that your patch uncovered the problem, and I think now is a good time to fix it properly. > > > --- > drivers/gpu/drm/panfrost/panfrost_gpu.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c > b/drivers/gpu/drm/panfrost/panfrost_gpu.c > index 09f5e1563ebd..1c7276aaa182 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c > +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c > @@ -429,21 +429,29 @@ void panfrost_gpu_power_off(struct panfrost_device *pfdev) > int ret; > u32 val; > > - gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask); > + gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); > + gpu_write(pfdev, SHADER_PWROFF_HI, U32_MAX); > ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "shader power transition timeout"); > > gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); > + gpu_write(pfdev, TILER_PWROFF_HI, U32_MAX); > ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, > val, !val, 1, 1000); > if (ret) > dev_err(pfdev->dev, "tiler power transition timeout"); > > - gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask); > + gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); > ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, > - val, !val, 0, 1000); > + val, !val, 0, 1000); > + if (ret) > + dev_err(pfdev->dev, "l2_low power transition timeout"); > + > + gpu_write(pfdev, L2_PWROFF_HI, U32_MAX); > + ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, > + val, !val, 0, 1000); > if (ret) > dev_err(pfdev->dev, "l2 power transition timeout"); > }
Il 22/11/23 10:54, Boris Brezillon ha scritto: > Hi Angelo, > > On Wed, 22 Nov 2023 10:06:19 +0100 > AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > wrote: > >> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: >>> On 21/11/2023 17:55, Boris Brezillon wrote: >>>> On Tue, 21 Nov 2023 17:11:42 +0100 >>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>>> wrote: >>>> >>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>>>>> On 08/11/2023 14:20, Steven Price wrote: >>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request >>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: >>>>>>>> this means that in order to request poweroff of cores, we are supposed >>>>>>>> to write a bitmask of cores that should be powered off! >>>>>>>> This means that the panfrost_gpu_power_off() function has always been >>>>>>>> doing nothing. >>>>>>>> >>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff >>>>>>>> to the relevant PWROFF_LO registers and then check that the transition >>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>>>>> registers. >>>>>>>> >>>>>>>> While at it, in order to avoid code duplication, move the core mask >>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>>>>> function, used in both poweron and poweroff. >>>>>>>> >>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> This commit was added to next recently but it causes "external abort on >>>>>> non-linefetch" during boot of my Odroid HC1 board. >>>>>> >>>>>> At least bisect points to it. >>>>>> >>>>>> If fixed, please add: >>>>>> >>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>>>>> >>>>>> [ 4.861683] 8<--- cut here --- >>>>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c >>>>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>>>>> ... >>>>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c >>>>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 >>>>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>>>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 >>>>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 >>>>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>>>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>>>>> >>>>>> Full log: >>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>>>>> >>>>> >>>>> Hey Krzysztof, >>>>> >>>>> This is interesting. It might be about the cores that are missing from the partial >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>>>> see here. >>>> >>>> I've seen such external aborts in the past, and the fault type has >>>> often been misleading. It's unlikely to have anything to do with a >>> >>> Yeah, often accessing device with power or clocks gated. >>> >> >> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 > > It's not directly related to your commit, it's just a side effect. > Indeed! >> >> What the "Really power off ..." commit does is to ask the GPU to internally power >> off the shaders, tilers and L2, that's why I say that it is strange to see that >> kind of abort. >> >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} >> registers should still be accessible even with shaders, tilers and cache OFF. > > It's not the power_off() call that's problematic, it's when it happens > (the last thing called in panfrost_device_runtime_suspend()), and the > fact it generates interrupts. Yes, you don't explicitly gate the clocks > in panfrost_device_runtime_suspend(), but the PM layer does interact > directly with power domains, and shutting down a power domain might > result in other clks/components being gated, which might make the > register bank inaccessible from the CPU. > >> >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also >> work, but that'd add up quite a bit of latency on the runtime_suspend() call, > > Really? In practice I'd expect no pending interrupts, other than the > power transition ones, which are purely and simply ignored by the > handler. If we had any other pending interrupts on suspend, we would > have faced this problem before. To sum-up, I'd expect the extra latency > to just be the overhead of the synchronize_irq() call, which, after > looking at the code, shouldn't be such a big deal. > >> so >> in this case I'd be more for avoiding to execute any register r/w in the handler >> by either checking if the GPU is supposed to be OFF, > > Yes, that's an option, but I don't think that's enough (see below). > >> or clearing interrupts, > > The handler might have been called already when you clear the > interrupt, and you'd still need to make sure the handler has returned > before returning from panfrost_device_runtime_suspend() if you want to > guarantee no one is touching the registers when the power domains are > shutdown. > >> which >> may not work if those are generated after the execution of the poweroff function. > > They are generated while you poll the register, but that doesn't > guarantee they will be processed by the time you return from your > power_off() function, which I think is exactly the problem we're facing > here. > >> Or we could simply disable the irq after power_off, but that'd be hacky (as well). > > If by disabling the interrupt you mean calling disable_irq(), that > would work if the irq lines were not declared as shared (IRQF_SHARED > flag passed at request time). Beside, the latency of disable_irq() > should be pretty much the same as synchronize_irq(), given > synchronize_irq() from there. > > If by disabling the interrupt, you mean masking it with _INT_MASK, > then, as said above, that's not enough. You need to make sure any > handler that may have been called as a result of this interrupt, > returns before you return from the suspend function, so you need some > kind of synchronization. > Your reasons are totally valid and I see the point. That's what I'll do as a follow-up Fixes patch: - gpu_write(pfdev, GPU_INT_MASK, 0); - gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL); - synchronize_irq() - poweroff *all* shaders/tilers/l2 (without caring about core_mask) - *No* INT_MASK restore, as we rely on soft_reset() to do that for us once we resume the GPU. >> >> >> Let's see if asking to poweroff *everything* works: > > It might slightly change the timing, making this problem disappear by > chance (if the interrupt gets processed before power_off() returns), > but it doesn't make the suspend logic more robust. We really have to > guarantee that no one will touch the registers when we enter suspend, > be it some interrupt handler, or any kind of deferred work. > > Again, none of this is a direct result of your patch, it's just that > your patch uncovered the problem, and I think now is a good time to fix > it properly. > Yes, I am well aware of that and I was trying to make that clear in the first place - I'm sorry if I gave the impression of having any kind of doubt around that, or any other. Cheers! Angelo
Il 22/11/23 10:48, Steven Price ha scritto: > On 22/11/2023 09:06, AngeloGioacchino Del Regno wrote: >> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: >>> On 21/11/2023 17:55, Boris Brezillon wrote: >>>> On Tue, 21 Nov 2023 17:11:42 +0100 >>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> >>>> wrote: >>>> >>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: >>>>>> On 08/11/2023 14:20, Steven Price wrote: >>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: >>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to >>>>>>>> request >>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO >>>>>>>> ones: >>>>>>>> this means that in order to request poweroff of cores, we are >>>>>>>> supposed >>>>>>>> to write a bitmask of cores that should be powered off! >>>>>>>> This means that the panfrost_gpu_power_off() function has always >>>>>>>> been >>>>>>>> doing nothing. >>>>>>>> >>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to >>>>>>>> poweroff >>>>>>>> to the relevant PWROFF_LO registers and then check that the >>>>>>>> transition >>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO >>>>>>>> registers. >>>>>>>> >>>>>>>> While at it, in order to avoid code duplication, move the core mask >>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() >>>>>>>> function, used in both poweron and poweroff. >>>>>>>> >>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") >>>>>>>> Signed-off-by: AngeloGioacchino Del Regno >>>>>>>> <angelogioacchino.delregno@collabora.com> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> This commit was added to next recently but it causes "external >>>>>> abort on >>>>>> non-linefetch" during boot of my Odroid HC1 board. >>>>>> >>>>>> At least bisect points to it. >>>>>> >>>>>> If fixed, please add: >>>>>> >>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>>>>> >>>>>> [ 4.861683] 8<--- cut here --- >>>>>> [ 4.863429] Unhandled fault: external abort on non-linefetch >>>>>> (0x1008) at 0xf0c8802c >>>>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 >>>>>> ... >>>>>> [ 5.164010] panfrost_gpu_irq_handler from >>>>>> __handle_irq_event_percpu+0xcc/0x31c >>>>>> [ 5.171276] __handle_irq_event_percpu from >>>>>> handle_irq_event+0x38/0x80 >>>>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 >>>>>> [ 5.183743] handle_fasteoi_irq from >>>>>> generic_handle_domain_irq+0x28/0x38 >>>>>> [ 5.190417] generic_handle_domain_irq from >>>>>> gic_handle_irq+0x88/0xa8 >>>>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 >>>>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 >>>>>> >>>>>> Full log: >>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 >>>>>> >>>>> >>>>> Hey Krzysztof, >>>>> >>>>> This is interesting. It might be about the cores that are missing >>>>> from the partial >>>>> core_mask raising interrupts, but an external abort on non-linefetch >>>>> is strange to >>>>> see here. >>>> >>>> I've seen such external aborts in the past, and the fault type has >>>> often been misleading. It's unlikely to have anything to do with a >>> >>> Yeah, often accessing device with power or clocks gated. >>> >> >> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 >> >> What the "Really power off ..." commit does is to ask the GPU to >> internally power >> off the shaders, tilers and L2, that's why I say that it is strange to >> see that >> kind of abort. >> >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and >> GPU_FAULT_ADDRESS_{HI/LO} >> registers should still be accessible even with shaders, tilers and cache >> OFF. >> >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence >> would also >> work, but that'd add up quite a bit of latency on the runtime_suspend() >> call, so >> in this case I'd be more for avoiding to execute any register r/w in the >> handler >> by either checking if the GPU is supposed to be OFF, or clearing >> interrupts, which >> may not work if those are generated after the execution of the poweroff >> function. >> Or we could simply disable the irq after power_off, but that'd be hacky >> (as well). >> >> >> Let's see if asking to poweroff *everything* works: >> >> >> --- >> drivers/gpu/drm/panfrost/panfrost_gpu.c | 14 +++++++++++--- >> 1 file changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c >> b/drivers/gpu/drm/panfrost/panfrost_gpu.c >> index 09f5e1563ebd..1c7276aaa182 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c >> @@ -429,21 +429,29 @@ void panfrost_gpu_power_off(struct panfrost_device >> *pfdev) >> int ret; >> u32 val; >> >> - gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & >> core_mask); >> + gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present); > > Hopefully this one line change, and... > >> + gpu_write(pfdev, SHADER_PWROFF_HI, U32_MAX); >> ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, >> val, !val, 1, 1000); >> if (ret) >> dev_err(pfdev->dev, "shader power transition timeout"); >> >> gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); >> + gpu_write(pfdev, TILER_PWROFF_HI, U32_MAX); >> ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, >> val, !val, 1, 1000); >> if (ret) >> dev_err(pfdev->dev, "tiler power transition timeout"); >> >> - gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & >> core_mask); >> + gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present); > > ... this one are all that are actually needed - the rest should be > ignored as they affect cores that aren't present. > Honestly - when I wrote that diff, I didn't care at all whether the HI registers were powering off cores that weren't present, because I knew that the GPU would have handled that gracefully anyway. What I wanted to do was to reduce Krzysztof's testing effort to a minimum, actually preventing to send more than one patch to try... but with that, you bought me a bit of precious time that I would've spent with research, so, hats off! Thank you! > The Exynos 5422 SoC has a T628 MP6 - so two core groups which isn't a > particularly well supported configuration. But I'm not sure how we're > ending up with the second core group being powered up in the first > place. Even if it was left powered by something previous (e.g. the > bootloader) then the soft-reset during probe should cause them to power > down. > Hm. I didn't know that soft_reset is supposed to (and will) power down cores. This is clarifying some things I didn't really have an explanation for... so thanks again :-) > But it seems like a good idea to power off everything when powering > down, even if we didn't expect the cores to be on. > > Boris also has a point that before cutting the power/clocks we should > really be synchronising with the IRQs - but that affects the follow on > patches not this one. > ...which gives me some more ideas to try... in the near future. But it's out of context for this fix anyway. Cheers, Angelo > Steve > >> ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, >> - val, !val, 0, 1000); >> + val, !val, 0, 1000); >> + if (ret) >> + dev_err(pfdev->dev, "l2_low power transition timeout"); >> + >> + gpu_write(pfdev, L2_PWROFF_HI, U32_MAX); >> + ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI, >> + val, !val, 0, 1000); >> if (ret) >> dev_err(pfdev->dev, "l2 power transition timeout"); >> } >
On Wed, 22 Nov 2023 11:23:05 +0100 AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> wrote: > Il 22/11/23 10:54, Boris Brezillon ha scritto: > > Hi Angelo, > > > > On Wed, 22 Nov 2023 10:06:19 +0100 > > AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > > wrote: > > > >> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto: > >>> On 21/11/2023 17:55, Boris Brezillon wrote: > >>>> On Tue, 21 Nov 2023 17:11:42 +0100 > >>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > >>>> wrote: > >>>> > >>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto: > >>>>>> On 08/11/2023 14:20, Steven Price wrote: > >>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote: > >>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request > >>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones: > >>>>>>>> this means that in order to request poweroff of cores, we are supposed > >>>>>>>> to write a bitmask of cores that should be powered off! > >>>>>>>> This means that the panfrost_gpu_power_off() function has always been > >>>>>>>> doing nothing. > >>>>>>>> > >>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff > >>>>>>>> to the relevant PWROFF_LO registers and then check that the transition > >>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO > >>>>>>>> registers. > >>>>>>>> > >>>>>>>> While at it, in order to avoid code duplication, move the core mask > >>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask() > >>>>>>>> function, used in both poweron and poweroff. > >>>>>>>> > >>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") > >>>>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> > >>>>>> > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> This commit was added to next recently but it causes "external abort on > >>>>>> non-linefetch" during boot of my Odroid HC1 board. > >>>>>> > >>>>>> At least bisect points to it. > >>>>>> > >>>>>> If fixed, please add: > >>>>>> > >>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > >>>>>> > >>>>>> [ 4.861683] 8<--- cut here --- > >>>>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c > >>>>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453 > >>>>>> ... > >>>>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c > >>>>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80 > >>>>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250 > >>>>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38 > >>>>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 > >>>>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44 > >>>>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0 > >>>>>> > >>>>>> Full log: > >>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0 > >>>>>> > >>>>> > >>>>> Hey Krzysztof, > >>>>> > >>>>> This is interesting. It might be about the cores that are missing from the partial > >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to > >>>>> see here. > >>>> > >>>> I've seen such external aborts in the past, and the fault type has > >>>> often been misleading. It's unlikely to have anything to do with a > >>> > >>> Yeah, often accessing device with power or clocks gated. > >>> > >> > >> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 > > > > It's not directly related to your commit, it's just a side effect. > > > > Indeed! > > >> > >> What the "Really power off ..." commit does is to ask the GPU to internally power > >> off the shaders, tilers and L2, that's why I say that it is strange to see that > >> kind of abort. > >> > >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} > >> registers should still be accessible even with shaders, tilers and cache OFF. > > > > It's not the power_off() call that's problematic, it's when it happens > > (the last thing called in panfrost_device_runtime_suspend()), and the > > fact it generates interrupts. Yes, you don't explicitly gate the clocks > > in panfrost_device_runtime_suspend(), but the PM layer does interact > > directly with power domains, and shutting down a power domain might > > result in other clks/components being gated, which might make the > > register bank inaccessible from the CPU. > > > >> > >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also > >> work, but that'd add up quite a bit of latency on the runtime_suspend() call, > > > > Really? In practice I'd expect no pending interrupts, other than the > > power transition ones, which are purely and simply ignored by the > > handler. If we had any other pending interrupts on suspend, we would > > have faced this problem before. To sum-up, I'd expect the extra latency > > to just be the overhead of the synchronize_irq() call, which, after > > looking at the code, shouldn't be such a big deal. > > > >> so > >> in this case I'd be more for avoiding to execute any register r/w in the handler > >> by either checking if the GPU is supposed to be OFF, > > > > Yes, that's an option, but I don't think that's enough (see below). > > > >> or clearing interrupts, > > > > The handler might have been called already when you clear the > > interrupt, and you'd still need to make sure the handler has returned > > before returning from panfrost_device_runtime_suspend() if you want to > > guarantee no one is touching the registers when the power domains are > > shutdown. > > > >> which > >> may not work if those are generated after the execution of the poweroff function. > > > > They are generated while you poll the register, but that doesn't > > guarantee they will be processed by the time you return from your > > power_off() function, which I think is exactly the problem we're facing > > here. > > > >> Or we could simply disable the irq after power_off, but that'd be hacky (as well). > > > > If by disabling the interrupt you mean calling disable_irq(), that > > would work if the irq lines were not declared as shared (IRQF_SHARED > > flag passed at request time). Beside, the latency of disable_irq() > > should be pretty much the same as synchronize_irq(), given > > synchronize_irq() from there. > > > > If by disabling the interrupt, you mean masking it with _INT_MASK, > > then, as said above, that's not enough. You need to make sure any > > handler that may have been called as a result of this interrupt, > > returns before you return from the suspend function, so you need some > > kind of synchronization. > > > > Your reasons are totally valid and I see the point. > > That's what I'll do as a follow-up Fixes patch: > - gpu_write(pfdev, GPU_INT_MASK, 0); > - gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL); > - synchronize_irq() More generally, I think we should have helpers that do that for the 3 irqs we in panfrost (gpu, mmu and job), because ultimately, the problem exists for all of them. > - poweroff *all* shaders/tilers/l2 (without caring about core_mask) Sounds good to me. > - *No* INT_MASK restore, as we rely on soft_reset() to do that for us > once we resume the GPU. Yeah, I didn't check, but if soft_reset() restores all the _INT_MASK properly, and it's called in the resume path, we're good. > > > >> > >> > >> Let's see if asking to poweroff *everything* works: > > > > It might slightly change the timing, making this problem disappear by > > chance (if the interrupt gets processed before power_off() returns), > > but it doesn't make the suspend logic more robust. We really have to > > guarantee that no one will touch the registers when we enter suspend, > > be it some interrupt handler, or any kind of deferred work. > > > > Again, none of this is a direct result of your patch, it's just that > > your patch uncovered the problem, and I think now is a good time to fix > > it properly. > > > > Yes, I am well aware of that and I was trying to make that clear in the first > place - I'm sorry if I gave the impression of having any kind of doubt around > that, or any other. Not particularly, just wanted to insist on the fact there is no blame to be taken for this regression, and that's actually a good opportunity to fix the PM logic with regards to interrupt handling. I'm glad you're now volunteering for that :-).
On 22.11.2023 10:29, Krzysztof Kozlowski wrote: > On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: >>>>> Hey Krzysztof, >>>>> >>>>> This is interesting. It might be about the cores that are missing from the partial >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>>>> see here. >>>> I've seen such external aborts in the past, and the fault type has >>>> often been misleading. It's unlikely to have anything to do with a >>> Yeah, often accessing device with power or clocks gated. >>> >> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 > It could be that something (like clocks or power supplies) was missing > on this board/SoC, which was not critical till your patch came. > >> What the "Really power off ..." commit does is to ask the GPU to internally power >> off the shaders, tilers and L2, that's why I say that it is strange to see that >> kind of abort. >> >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} >> registers should still be accessible even with shaders, tilers and cache OFF. >> >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also >> work, but that'd add up quite a bit of latency on the runtime_suspend() call, so >> in this case I'd be more for avoiding to execute any register r/w in the handler >> by either checking if the GPU is supposed to be OFF, or clearing interrupts, which >> may not work if those are generated after the execution of the poweroff function. >> Or we could simply disable the irq after power_off, but that'd be hacky (as well). >> >> >> Let's see if asking to poweroff *everything* works: > Worked. Yes, I also got into this issue some time ago, but I didn't report it because I also had some power supply related problems on my test farm and everything was a bit unstable. I wasn't 100% sure that the $subject patch is responsible for the observed issues. Now, after fixing power supply, I confirm that the issue was revealed by the $subject patch and above mentioned change fixes the problem. Feel free to add: Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Best regards
On 24.11.2023 13:45, Marek Szyprowski wrote: > On 22.11.2023 10:29, Krzysztof Kozlowski wrote: >> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: >>>>>> Hey Krzysztof, >>>>>> >>>>>> This is interesting. It might be about the cores that are missing >>>>>> from the partial >>>>>> core_mask raising interrupts, but an external abort on >>>>>> non-linefetch is strange to >>>>>> see here. >>>>> I've seen such external aborts in the past, and the fault type has >>>>> often been misleading. It's unlikely to have anything to do with a >>>> Yeah, often accessing device with power or clocks gated. >>>> >>> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 >> It could be that something (like clocks or power supplies) was missing >> on this board/SoC, which was not critical till your patch came. >> >>> What the "Really power off ..." commit does is to ask the GPU to >>> internally power >>> off the shaders, tilers and L2, that's why I say that it is strange >>> to see that >>> kind of abort. >>> >>> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and >>> GPU_FAULT_ADDRESS_{HI/LO} >>> registers should still be accessible even with shaders, tilers and >>> cache OFF. >>> >>> Anyway, yes, synchronizing IRQs before calling the poweroff sequence >>> would also >>> work, but that'd add up quite a bit of latency on the >>> runtime_suspend() call, so >>> in this case I'd be more for avoiding to execute any register r/w in >>> the handler >>> by either checking if the GPU is supposed to be OFF, or clearing >>> interrupts, which >>> may not work if those are generated after the execution of the >>> poweroff function. >>> Or we could simply disable the irq after power_off, but that'd be >>> hacky (as well). >>> >>> >>> Let's see if asking to poweroff *everything* works: >> Worked. > > Yes, I also got into this issue some time ago, but I didn't report it > because I also had some power supply related problems on my test farm > and everything was a bit unstable. I wasn't 100% sure that the > $subject patch is responsible for the observed issues. Now, after > fixing power supply, I confirm that the issue was revealed by the > $subject patch and above mentioned change fixes the problem. Feel free > to add: > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> I must revoke my tested-by tag for the above fix alone. Although it fixed the boot issue and system stability issue, it looks that there is still something missing and opening the panfrost dri device causes a system crash: root@target:~# ./modetest -C trying to open device 'i915'...failed trying to open device 'amdgpu'...failed trying to open device 'radeon'...failed trying to open device 'nouveau'...failed trying to open device 'vmwgfx'...failed trying to open device 'omapdrm'...failed trying to open device 'exynos'...done root@target:~# 8<--- cut here --- Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c [f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453 Internal error: : 1008 [#1] PREEMPT SMP ARM Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc s5p_cec CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649 Hardware name: Samsung Exynos (Flattened Device Tree) PC is at panfrost_gpu_irq_handler+0x18/0xfc LR is at __handle_irq_event_percpu+0xcc/0x31c ... Process swapper/0 (pid: 0, stack limit = 0x0e2875ff) Stack: (0xc1301e48 to 0xc1302000) ... panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c __handle_irq_event_percpu from handle_irq_event+0x38/0x80 handle_irq_event from handle_fasteoi_irq+0x9c/0x250 handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 gic_handle_irq from generic_handle_arch_irq+0x34/0x44 generic_handle_arch_irq from __irq_svc+0x8c/0xd0 Exception stack(0xc1301f10 to 0xc1301f58) ... __irq_svc from default_idle_call+0x20/0x2c4 default_idle_call from do_idle+0x244/0x2b4 do_idle from cpu_startup_entry+0x28/0x2c cpu_startup_entry from rest_init+0xec/0x190 rest_init from arch_post_acpi_subsys_init+0x0/0x8 Code: e591300c e593402c f57ff04f e591300c (e593903c) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception in interrupt CPU2: stopping It looks that the panfrost interrupts must be somehow synchronized with turning power off, what has been already discussed. Let me know if you want me to test any patch. Best regards
Il 27/11/23 12:24, Marek Szyprowski ha scritto: > On 24.11.2023 13:45, Marek Szyprowski wrote: >> On 22.11.2023 10:29, Krzysztof Kozlowski wrote: >>> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: >>>>>>> Hey Krzysztof, >>>>>>> >>>>>>> This is interesting. It might be about the cores that are missing >>>>>>> from the partial >>>>>>> core_mask raising interrupts, but an external abort on >>>>>>> non-linefetch is strange to >>>>>>> see here. >>>>>> I've seen such external aborts in the past, and the fault type has >>>>>> often been misleading. It's unlikely to have anything to do with a >>>>> Yeah, often accessing device with power or clocks gated. >>>>> >>>> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 >>> It could be that something (like clocks or power supplies) was missing >>> on this board/SoC, which was not critical till your patch came. >>> >>>> What the "Really power off ..." commit does is to ask the GPU to >>>> internally power >>>> off the shaders, tilers and L2, that's why I say that it is strange >>>> to see that >>>> kind of abort. >>>> >>>> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and >>>> GPU_FAULT_ADDRESS_{HI/LO} >>>> registers should still be accessible even with shaders, tilers and >>>> cache OFF. >>>> >>>> Anyway, yes, synchronizing IRQs before calling the poweroff sequence >>>> would also >>>> work, but that'd add up quite a bit of latency on the >>>> runtime_suspend() call, so >>>> in this case I'd be more for avoiding to execute any register r/w in >>>> the handler >>>> by either checking if the GPU is supposed to be OFF, or clearing >>>> interrupts, which >>>> may not work if those are generated after the execution of the >>>> poweroff function. >>>> Or we could simply disable the irq after power_off, but that'd be >>>> hacky (as well). >>>> >>>> >>>> Let's see if asking to poweroff *everything* works: >>> Worked. >> >> Yes, I also got into this issue some time ago, but I didn't report it >> because I also had some power supply related problems on my test farm >> and everything was a bit unstable. I wasn't 100% sure that the >> $subject patch is responsible for the observed issues. Now, after >> fixing power supply, I confirm that the issue was revealed by the >> $subject patch and above mentioned change fixes the problem. Feel free >> to add: >> >> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > > > I must revoke my tested-by tag for the above fix alone. Although it > fixed the boot issue and system stability issue, it looks that there is > still something missing and opening the panfrost dri device causes a > system crash: > > root@target:~# ./modetest -C > trying to open device 'i915'...failed > trying to open device 'amdgpu'...failed > trying to open device 'radeon'...failed > trying to open device 'nouveau'...failed > trying to open device 'vmwgfx'...failed > trying to open device 'omapdrm'...failed > trying to open device 'exynos'...done > root@target:~# > > 8<--- cut here --- > Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c > [f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453 > Internal error: : 1008 [#1] PREEMPT SMP ARM > Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem > videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common > videodev mc s5p_cec > CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649 > Hardware name: Samsung Exynos (Flattened Device Tree) > PC is at panfrost_gpu_irq_handler+0x18/0xfc > LR is at __handle_irq_event_percpu+0xcc/0x31c > ... > Process swapper/0 (pid: 0, stack limit = 0x0e2875ff) > Stack: (0xc1301e48 to 0xc1302000) > ... > panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c > __handle_irq_event_percpu from handle_irq_event+0x38/0x80 > handle_irq_event from handle_fasteoi_irq+0x9c/0x250 > handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34 > generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 > gic_handle_irq from generic_handle_arch_irq+0x34/0x44 > generic_handle_arch_irq from __irq_svc+0x8c/0xd0 > Exception stack(0xc1301f10 to 0xc1301f58) > ... > __irq_svc from default_idle_call+0x20/0x2c4 > default_idle_call from do_idle+0x244/0x2b4 > do_idle from cpu_startup_entry+0x28/0x2c > cpu_startup_entry from rest_init+0xec/0x190 > rest_init from arch_post_acpi_subsys_init+0x0/0x8 > Code: e591300c e593402c f57ff04f e591300c (e593903c) > ---[ end trace 0000000000000000 ]--- > Kernel panic - not syncing: Fatal exception in interrupt > CPU2: stopping > > > It looks that the panfrost interrupts must be somehow synchronized with > turning power off, what has been already discussed. Let me know if you > want me to test any patch. > The new series containing the whole interrupts sync code is almost ready, currently testing it on my machines here. I should be able to send it between today and tomorrow. Cheers, Angelo
On 22/11/2023 10:29, Krzysztof Kozlowski wrote: > On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: > >>>>>> >>>>> >>>>> Hey Krzysztof, >>>>> >>>>> This is interesting. It might be about the cores that are missing from the partial >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to >>>>> see here. >>>> >>>> I've seen such external aborts in the past, and the fault type has >>>> often been misleading. It's unlikely to have anything to do with a >>> >>> Yeah, often accessing device with power or clocks gated. >>> >> >> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 > > It could be that something (like clocks or power supplies) was missing > on this board/SoC, which was not critical till your patch came. > >> >> What the "Really power off ..." commit does is to ask the GPU to internally power >> off the shaders, tilers and L2, that's why I say that it is strange to see that >> kind of abort. >> >> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and GPU_FAULT_ADDRESS_{HI/LO} >> registers should still be accessible even with shaders, tilers and cache OFF. >> >> Anyway, yes, synchronizing IRQs before calling the poweroff sequence would also >> work, but that'd add up quite a bit of latency on the runtime_suspend() call, so >> in this case I'd be more for avoiding to execute any register r/w in the handler >> by either checking if the GPU is supposed to be OFF, or clearing interrupts, which >> may not work if those are generated after the execution of the poweroff function. >> Or we could simply disable the irq after power_off, but that'd be hacky (as well). >> >> >> Let's see if asking to poweroff *everything* works: > > Worked. Is anything happening with this patchset? linux-next is broken - fails to boot - for three weeks now on my platforms. Best regards, Krzysztof
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index f0be7e19b13e..fad75b6e543e 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -362,28 +362,38 @@ unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev) return ((u64)hi << 32) | lo; } +static u64 panfrost_get_core_mask(struct panfrost_device *pfdev) +{ + u64 core_mask; + + if (pfdev->features.l2_present == 1) + return U64_MAX; + + /* + * Only support one core group now. + * ~(l2_present - 1) unsets all bits in l2_present except + * the bottom bit. (l2_present - 2) has all the bits in + * the first core group set. AND them together to generate + * a mask of cores in the first core group. + */ + core_mask = ~(pfdev->features.l2_present - 1) & + (pfdev->features.l2_present - 2); + dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n", + hweight64(core_mask), + hweight64(pfdev->features.shader_present)); + + return core_mask; +} + void panfrost_gpu_power_on(struct panfrost_device *pfdev) { int ret; u32 val; - u64 core_mask = U64_MAX; + u64 core_mask; panfrost_gpu_init_quirks(pfdev); + core_mask = panfrost_get_core_mask(pfdev); - if (pfdev->features.l2_present != 1) { - /* - * Only support one core group now. - * ~(l2_present - 1) unsets all bits in l2_present except - * the bottom bit. (l2_present - 2) has all the bits in - * the first core group set. AND them together to generate - * a mask of cores in the first core group. - */ - core_mask = ~(pfdev->features.l2_present - 1) & - (pfdev->features.l2_present - 2); - dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n", - hweight64(core_mask), - hweight64(pfdev->features.shader_present)); - } gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO, val, val == (pfdev->features.l2_present & core_mask), @@ -408,11 +418,30 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev) void panfrost_gpu_power_off(struct panfrost_device *pfdev) { - gpu_write(pfdev, TILER_PWROFF_LO, 0); - gpu_write(pfdev, SHADER_PWROFF_LO, 0); - gpu_write(pfdev, L2_PWROFF_LO, 0); + u64 core_mask = panfrost_get_core_mask(pfdev); + int ret; + u32 val; + + gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask); + ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO, + val, !val, 1, 1000); + if (ret) + dev_err(pfdev->dev, "shader power transition timeout"); + + gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present); + ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO, + val, !val, 1, 1000); + if (ret) + dev_err(pfdev->dev, "tiler power transition timeout"); + + gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask); + ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO, + val, !val, 0, 1000); + if (ret) + dev_err(pfdev->dev, "l2 power transition timeout"); } + int panfrost_gpu_init(struct panfrost_device *pfdev) { int err, irq;