[1/6] dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW

Message ID 20230511150539.1.Iabe67a827e206496efec6beb5616d5a3b99c1e65@changeid
State New
Headers
Series irqchip/gic-v3: Disable pseudo NMIs on Mediatek Chromebooks w/ bad FW |

Commit Message

Doug Anderson May 11, 2023, 10:05 p.m. UTC
  When trying to turn on the "pseudo NMI" kernel feature in Linux, it
was discovered that all Mediatek-based Chromebooks that ever shipped
(at least ones with GICv3) had a firmware bug where they wouldn't save
certain GIC "GICR" registers properly. If a processor ever entered a
suspend/idle mode where the GICR registers lost state then they'd be
reset to their default state.

As a result of the bug, if you try to enable "pseudo NMIs" on the
affected devices then certain interrupts will unexpectedly get
promoted to be "pseudo NMIs" and cause crashes / freezes / general
mayhem.

ChromeOS is looking to start turning on "pseudo NMIs" in production to
make crash reports more actionable. To do so, we will release firmware
updates for at least some of the affected Mediatek Chromebooks.
However, even when we update the firmware of a Chromebook it's always
possible that a user will end up booting with old firmware. We need to
be able to detect when we're running with firmware that will crash and
burn if pseudo NMIs are enabled.

The current plan is:
* Update the device trees of all affected Chromebooks to include the
  'mediatek,gicr-save-quirk' property. The kernel can use this to know
  not to enable certain features like "pseudo NMI". NOTE: device trees
  for Chromebooks are never baked into the firmware but are bundled
  with the kernel. A kernel will never be configured to use "pseudo
  NMIs" and be bundled with an old device tree.
* When we get a fixed firmware for one of these Chromebooks, it will
  patch the device tree to remove this property.

For some details, you can also see the public bug
<https://issuetracker.google.com/281831288>

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 .../bindings/interrupt-controller/arm,gic-v3.yaml           | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Julius Werner May 11, 2023, 10:37 p.m. UTC | #1
Reviewed-by: Julius Werner <jwerner@chromium.org>
  
Marc Zyngier May 12, 2023, 8:02 a.m. UTC | #2
On Thu, 11 May 2023 23:05:35 +0100,
Douglas Anderson <dianders@chromium.org> wrote:
> 
> When trying to turn on the "pseudo NMI" kernel feature in Linux, it
> was discovered that all Mediatek-based Chromebooks that ever shipped
> (at least ones with GICv3) had a firmware bug where they wouldn't save
> certain GIC "GICR" registers properly. If a processor ever entered a
> suspend/idle mode where the GICR registers lost state then they'd be
> reset to their default state.
> 
> As a result of the bug, if you try to enable "pseudo NMIs" on the
> affected devices then certain interrupts will unexpectedly get
> promoted to be "pseudo NMIs" and cause crashes / freezes / general
> mayhem.
> 
> ChromeOS is looking to start turning on "pseudo NMIs" in production to
> make crash reports more actionable. To do so, we will release firmware
> updates for at least some of the affected Mediatek Chromebooks.
> However, even when we update the firmware of a Chromebook it's always
> possible that a user will end up booting with old firmware. We need to
> be able to detect when we're running with firmware that will crash and
> burn if pseudo NMIs are enabled.
> 
> The current plan is:
> * Update the device trees of all affected Chromebooks to include the
>   'mediatek,gicr-save-quirk' property. The kernel can use this to know
>   not to enable certain features like "pseudo NMI". NOTE: device trees
>   for Chromebooks are never baked into the firmware but are bundled
>   with the kernel. A kernel will never be configured to use "pseudo
>   NMIs" and be bundled with an old device tree.
> * When we get a fixed firmware for one of these Chromebooks, it will
>   patch the device tree to remove this property.

Since you're in control of distributing the FW together with the
kernel, I assume you're also in control of the command line. Why can't
that firmware pass the option enabling the pseudo-NMI support,
dispensing ourselves from all of this?

> 
> For some details, you can also see the public bug
> <https://issuetracker.google.com/281831288>
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>  .../bindings/interrupt-controller/arm,gic-v3.yaml           | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> index 92117261e1e1..8c251caae537 100644
> --- a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> +++ b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> @@ -166,6 +166,12 @@ properties:
>    resets:
>      maxItems: 1
>  
> +  mediatek,gicr-save-quirk:

I think this deserves something *much* stronger that outlines what is
wrong, because this is not just a quirk. This is a failure to even
remotely grasp the requirements of the architecture (and to use
standard, public code that would have done it correctly). Something
like "mediatek,broken-save-restore-fw" would be more adequate.

> +    type: boolean
> +    description:
> +      Asserts that the firmware on this device has issues saving and restoring
> +      GICR registers when CPUs are powered off.

Nit: not the the CPUs, but the GIC redistributors.

Thanks,

	M.
  
Matthias Brugger May 12, 2023, 11:32 a.m. UTC | #3
Hi Julius,

On 12/05/2023 00:37, Julius Werner wrote:
> Reviewed-by: Julius Werner <jwerner@chromium.org>

For the future please don't delete the patch itself if you give a tag, just 
leave it as it is (and don't top-post your tag :) )

Regards,
Matthias
  
Doug Anderson May 12, 2023, 1:40 p.m. UTC | #4
Hi,

On Fri, May 12, 2023 at 1:02 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 11 May 2023 23:05:35 +0100,
> Douglas Anderson <dianders@chromium.org> wrote:
> >
> > When trying to turn on the "pseudo NMI" kernel feature in Linux, it
> > was discovered that all Mediatek-based Chromebooks that ever shipped
> > (at least ones with GICv3) had a firmware bug where they wouldn't save
> > certain GIC "GICR" registers properly. If a processor ever entered a
> > suspend/idle mode where the GICR registers lost state then they'd be
> > reset to their default state.
> >
> > As a result of the bug, if you try to enable "pseudo NMIs" on the
> > affected devices then certain interrupts will unexpectedly get
> > promoted to be "pseudo NMIs" and cause crashes / freezes / general
> > mayhem.
> >
> > ChromeOS is looking to start turning on "pseudo NMIs" in production to
> > make crash reports more actionable. To do so, we will release firmware
> > updates for at least some of the affected Mediatek Chromebooks.
> > However, even when we update the firmware of a Chromebook it's always
> > possible that a user will end up booting with old firmware. We need to
> > be able to detect when we're running with firmware that will crash and
> > burn if pseudo NMIs are enabled.
> >
> > The current plan is:
> > * Update the device trees of all affected Chromebooks to include the
> >   'mediatek,gicr-save-quirk' property. The kernel can use this to know
> >   not to enable certain features like "pseudo NMI". NOTE: device trees
> >   for Chromebooks are never baked into the firmware but are bundled
> >   with the kernel. A kernel will never be configured to use "pseudo
> >   NMIs" and be bundled with an old device tree.
> > * When we get a fixed firmware for one of these Chromebooks, it will
> >   patch the device tree to remove this property.
>
> Since you're in control of distributing the FW together with the
> kernel, I assume you're also in control of the command line. Why can't
> that firmware pass the option enabling the pseudo-NMI support,
> dispensing ourselves from all of this?

I considered the option, but it gets really awkward. Specifically:

1. We can't have old firmwares "take away" the kernel command line
option enabling pseudoNMI, obviously.

2. Having new firmware inject `irqchip.gicv3_pseudo_nmi=1` into the
kernel command line feels really awkward from an abstraction point of
view. This bakes into the firmware an implementation detail about how
something is implemented / enabled in the kernel, which feels wrong.
In general I'm not a fan of needing the kernel command line argument
to begin with and I'd hope that eventually it could go away.

3. Having the firmware inject `irqchip.gicv3_pseudo_nmi=1` into the
kernel command line also makes it awkward when you consider
non-affected boards. On Qualcomm boards, Rockchip boards, and x86
boards we want to enable pseudo NMI without needing to spin the
firmware to have it add this option. ...so we have to add this option
to the "default" command line for every board _except_ affected
Mediatek boards? Is this weird convention something that our build
system carries as we move to newer Mediatek SoCs that aren't affected?
What if we want a single build for Mediatek and Qulcomm boards (which
is something that has been bandied about, even if we haven't done it
yet).

Considering all the above, I think trying to use the
`irqchip.gicv3_pseudo_nmi=1` like this wouldn't fly.


> > For some details, you can also see the public bug
> > <https://issuetracker.google.com/281831288>
> >
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  .../bindings/interrupt-controller/arm,gic-v3.yaml           | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> > index 92117261e1e1..8c251caae537 100644
> > --- a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> > +++ b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
> > @@ -166,6 +166,12 @@ properties:
> >    resets:
> >      maxItems: 1
> >
> > +  mediatek,gicr-save-quirk:
>
> I think this deserves something *much* stronger that outlines what is
> wrong, because this is not just a quirk. This is a failure to even
> remotely grasp the requirements of the architecture (and to use
> standard, public code that would have done it correctly). Something
> like "mediatek,broken-save-restore-fw" would be more adequate.

Sure, I'll rename it in the next spin. Unless we get into a big
discussion somewhere else in this patch, I'll plan that for early next
week.


> > +    type: boolean
> > +    description:
> > +      Asserts that the firmware on this device has issues saving and restoring
> > +      GICR registers when CPUs are powered off.
>
> Nit: not the the CPUs, but the GIC redistributors.

Ah, good point. I'll update.
  

Patch

diff --git a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
index 92117261e1e1..8c251caae537 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
@@ -166,6 +166,12 @@  properties:
   resets:
     maxItems: 1
 
+  mediatek,gicr-save-quirk:
+    type: boolean
+    description:
+      Asserts that the firmware on this device has issues saving and restoring
+      GICR registers when CPUs are powered off.
+
 dependencies:
   mbi-ranges: [ msi-controller ]
   msi-controller: [ mbi-ranges ]