[1/2] arm64: smp: Fix pseudo NMI issues w/ broken Mediatek FW

Message ID 20231002094526.1.Ie8f760213053e3d11592f892b30912dbac6b8b48@changeid
State New
Headers
Series [1/2] arm64: smp: Fix pseudo NMI issues w/ broken Mediatek FW |

Commit Message

Doug Anderson Oct. 2, 2023, 4:45 p.m. UTC
  Some mediatek devices have the property
"mediatek,broken-save-restore-fw" in their GIC. This means that,
although the hardware supports pseudo-NMI, the firmware has a bug
that blocks enabling it. When we're in this state,
system_uses_irq_prio_masking() will return true but we'll fail to
actually enable the IRQ in the GIC.

Let's make the code handle this. We'll detect that we failed to
request an IPI as NMI and fallback to requesting it normally. Though
we expect that either all of our requests will fail or all will
succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
robust.

Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Reported-by: Chen-Yu Tsai <wenst@chromium.org>
Closes: https://issuetracker.google.com/issues/197061987#comment68
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/kernel/smp.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)
  

Comments

Doug Anderson Oct. 2, 2023, 7:16 p.m. UTC | #1
Hi,

On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > Some mediatek devices have the property
> > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > although the hardware supports pseudo-NMI, the firmware has a bug
> > that blocks enabling it. When we're in this state,
> > system_uses_irq_prio_masking() will return true but we'll fail to
> > actually enable the IRQ in the GIC.
> >
> > Let's make the code handle this. We'll detect that we failed to
> > request an IPI as NMI and fallback to requesting it normally. Though
> > we expect that either all of our requests will fail or all will
> > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > robust.
> >
> > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
>
> I'm not too keen on falling back here when we have no idea why the request failed.
>
> I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> account for the case of broken FW, e.g. as below.
>
> Mark.
>
> ---->8----
> From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Mon, 2 Oct 2023 18:00:36 +0100
> Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
>
> Some MediaTek devices have broken firmware which corrupts some GICR
> registers behind the back of the OS, and pseudo-NMIs cannot be used on
> these devices. For more details see commit:
>
>   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
>
> We did not take this problem into account in commit:
>
>   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
>
> Since that commit arm64's SMP code will try to setup some IPIs as
> pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> (rightly) reject attempts to request interrupts as pseudo-NMIs,
> resulting in boot-time failures.
>
> Avoid the problem by taking the broken FW into account when deciding to
> request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> named "supports_pseudo_nmis" which is false on systems with broken FW,
> and we can consult this within ipi_should_be_nmi().
>
> Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> Closes: https://issuetracker.google.com/issues/197061987#comment68
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kernel/smp.c      | 5 ++++-
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)

Sure, this is OK w/ me as long as folks don't mind accessing the
global here, it's OK w/ me:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

It seems to work for me, thus:

Tested-by: Douglas Anderson <dianders@chromium.org>


> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 814d9aa93b21b..061c69160f90f 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
>
>  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
>  {
> -       if (!system_uses_irq_prio_masking())
> +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> +
> +       if (!system_uses_irq_prio_masking() ||
> +           !static_branch_likely(&supports_pseudo_nmis))

One thought, actually, is whether we should actually change
system_uses_irq_prio_masking() to return the correct value. What do
you think?

-Doug
  
Mark Rutland Oct. 3, 2023, 12:29 p.m. UTC | #2
On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > Some mediatek devices have the property
> > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > that blocks enabling it. When we're in this state,
> > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > actually enable the IRQ in the GIC.
> > >
> > > Let's make the code handle this. We'll detect that we failed to
> > > request an IPI as NMI and fallback to requesting it normally. Though
> > > we expect that either all of our requests will fail or all will
> > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > robust.
> > >
> > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > ---
> > >
> > >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > >  1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > I'm not too keen on falling back here when we have no idea why the request failed.
> >
> > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > account for the case of broken FW, e.g. as below.
> >
> > Mark.
> >
> > ---->8----
> > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > From: Mark Rutland <mark.rutland@arm.com>
> > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> >
> > Some MediaTek devices have broken firmware which corrupts some GICR
> > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > these devices. For more details see commit:
> >
> >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> >
> > We did not take this problem into account in commit:
> >
> >   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> >
> > Since that commit arm64's SMP code will try to setup some IPIs as
> > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > resulting in boot-time failures.
> >
> > Avoid the problem by taking the broken FW into account when deciding to
> > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > and we can consult this within ipi_should_be_nmi().
> >
> > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Douglas Anderson <dianders@chromium.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kernel/smp.c      | 5 ++++-
> >  drivers/irqchip/irq-gic-v3.c | 2 +-
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> Sure, this is OK w/ me as long as folks don't mind accessing the
> global here, it's OK w/ me:
> 
> Reviewed-by: Douglas Anderson <dianders@chromium.org>
> 
> It seems to work for me, thus:
> 
> Tested-by: Douglas Anderson <dianders@chromium.org>
> 
> 
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 814d9aa93b21b..061c69160f90f 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> >
> >  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> >  {
> > -       if (!system_uses_irq_prio_masking())
> > +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > +
> > +       if (!system_uses_irq_prio_masking() ||
> > +           !static_branch_likely(&supports_pseudo_nmis))
> 
> One thought, actually, is whether we should actually change
> system_uses_irq_prio_masking() to return the correct value. What do
> you think?

I don't think we should add this to system_uses_irq_prio_masking(); that's used
by the low-level flags manipulation code that gets inlined all over the place,
and that code will work regarldess of whether we actually use NMI priorities.

If we want to avoid using PMR masking *at all* on these platforms, we'd need to
detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().

Thanks,
Mark.
  
Doug Anderson Oct. 3, 2023, 1:43 p.m. UTC | #3
Hi,

On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > > Some mediatek devices have the property
> > > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > > that blocks enabling it. When we're in this state,
> > > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > > actually enable the IRQ in the GIC.
> > > >
> > > > Let's make the code handle this. We'll detect that we failed to
> > > > request an IPI as NMI and fallback to requesting it normally. Though
> > > > we expect that either all of our requests will fail or all will
> > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > > robust.
> > > >
> > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > ---
> > > >
> > > >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > > >  1 file changed, 12 insertions(+), 7 deletions(-)
> > >
> > > I'm not too keen on falling back here when we have no idea why the request failed.
> > >
> > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > > account for the case of broken FW, e.g. as below.
> > >
> > > Mark.
> > >
> > > ---->8----
> > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > > From: Mark Rutland <mark.rutland@arm.com>
> > > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> > >
> > > Some MediaTek devices have broken firmware which corrupts some GICR
> > > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > > these devices. For more details see commit:
> > >
> > >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> > >
> > > We did not take this problem into account in commit:
> > >
> > >   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > >
> > > Since that commit arm64's SMP code will try to setup some IPIs as
> > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > > resulting in boot-time failures.
> > >
> > > Avoid the problem by taking the broken FW into account when deciding to
> > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > > and we can consult this within ipi_should_be_nmi().
> > >
> > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > > Cc: Douglas Anderson <dianders@chromium.org>
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > ---
> > >  arch/arm64/kernel/smp.c      | 5 ++++-
> > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > >  2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > Sure, this is OK w/ me as long as folks don't mind accessing the
> > global here, it's OK w/ me:
> >
> > Reviewed-by: Douglas Anderson <dianders@chromium.org>
> >
> > It seems to work for me, thus:
> >
> > Tested-by: Douglas Anderson <dianders@chromium.org>
> >
> >
> > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > index 814d9aa93b21b..061c69160f90f 100644
> > > --- a/arch/arm64/kernel/smp.c
> > > +++ b/arch/arm64/kernel/smp.c
> > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> > >
> > >  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> > >  {
> > > -       if (!system_uses_irq_prio_masking())
> > > +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > > +
> > > +       if (!system_uses_irq_prio_masking() ||
> > > +           !static_branch_likely(&supports_pseudo_nmis))
> >
> > One thought, actually, is whether we should actually change
> > system_uses_irq_prio_masking() to return the correct value. What do
> > you think?
>
> I don't think we should add this to system_uses_irq_prio_masking(); that's used
> by the low-level flags manipulation code that gets inlined all over the place,
> and that code will work regarldess of whether we actually use NMI priorities.
>
> If we want to avoid using PMR masking *at all* on these platforms, we'd need to
> detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().

I suspect that anyone trying to use PMR masking on these systems for
any purpose will be unhappy. The issue is talked about in:

https://issuetracker.google.com/281831288

...where you can see that the firmware on these systems isn't properly
saving/restoring some registers, including GICR_IPRIORITYR.

-Doug
  
Mark Rutland Oct. 3, 2023, 4:32 p.m. UTC | #4
On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> > > Hi,
> > >
> > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > > > Some mediatek devices have the property
> > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > > > that blocks enabling it. When we're in this state,
> > > > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > > > actually enable the IRQ in the GIC.
> > > > >
> > > > > Let's make the code handle this. We'll detect that we failed to
> > > > > request an IPI as NMI and fallback to requesting it normally. Though
> > > > > we expect that either all of our requests will fail or all will
> > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > > > robust.
> > > > >
> > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > > ---
> > > > >
> > > > >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > > > >  1 file changed, 12 insertions(+), 7 deletions(-)
> > > >
> > > > I'm not too keen on falling back here when we have no idea why the request failed.
> > > >
> > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > > > account for the case of broken FW, e.g. as below.
> > > >
> > > > Mark.
> > > >
> > > > ---->8----
> > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > > > From: Mark Rutland <mark.rutland@arm.com>
> > > > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> > > >
> > > > Some MediaTek devices have broken firmware which corrupts some GICR
> > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > > > these devices. For more details see commit:
> > > >
> > > >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> > > >
> > > > We did not take this problem into account in commit:
> > > >
> > > >   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > >
> > > > Since that commit arm64's SMP code will try to setup some IPIs as
> > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > > > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > > > resulting in boot-time failures.
> > > >
> > > > Avoid the problem by taking the broken FW into account when deciding to
> > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > > > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > > > and we can consult this within ipi_should_be_nmi().
> > > >
> > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > > > Cc: Douglas Anderson <dianders@chromium.org>
> > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > ---
> > > >  arch/arm64/kernel/smp.c      | 5 ++++-
> > > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > >
> > > Sure, this is OK w/ me as long as folks don't mind accessing the
> > > global here, it's OK w/ me:
> > >
> > > Reviewed-by: Douglas Anderson <dianders@chromium.org>
> > >
> > > It seems to work for me, thus:
> > >
> > > Tested-by: Douglas Anderson <dianders@chromium.org>
> > >
> > >
> > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > index 814d9aa93b21b..061c69160f90f 100644
> > > > --- a/arch/arm64/kernel/smp.c
> > > > +++ b/arch/arm64/kernel/smp.c
> > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> > > >
> > > >  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> > > >  {
> > > > -       if (!system_uses_irq_prio_masking())
> > > > +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > > > +
> > > > +       if (!system_uses_irq_prio_masking() ||
> > > > +           !static_branch_likely(&supports_pseudo_nmis))
> > >
> > > One thought, actually, is whether we should actually change
> > > system_uses_irq_prio_masking() to return the correct value. What do
> > > you think?
> >
> > I don't think we should add this to system_uses_irq_prio_masking(); that's used
> > by the low-level flags manipulation code that gets inlined all over the place,
> > and that code will work regarldess of whether we actually use NMI priorities.
> >
> > If we want to avoid using PMR masking *at all* on these platforms, we'd need to
> > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().
> 
> I suspect that anyone trying to use PMR masking on these systems for
> any purpose will be unhappy. The issue is talked about in:
> 
> https://issuetracker.google.com/281831288
> 
> ...where you can see that the firmware on these systems isn't properly
> saving/restoring some registers, including GICR_IPRIORITYR.

Ok, then that's a latent bug even before the IPI changes, going back to the
original workaround in commit:

  44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")

For the sake of those reading the archive, can we have a better description of
what exactly happens on these boards?

IIUC on these boards the firmware fails to save+restore (some?) GICR registers
across (some?) PSCI CPU_SUSPEND idle states.

Which registers does it save+restore?

Does it reset other registers into a specific state?

Thanks,
Mark.
  
Doug Anderson Oct. 3, 2023, 7:32 p.m. UTC | #5
Hi,

On Tue, Oct 3, 2023 at 9:32 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> > > > Hi,
> > > >
> > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > > >
> > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > > > > Some mediatek devices have the property
> > > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > > > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > > > > that blocks enabling it. When we're in this state,
> > > > > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > > > > actually enable the IRQ in the GIC.
> > > > > >
> > > > > > Let's make the code handle this. We'll detect that we failed to
> > > > > > request an IPI as NMI and fallback to requesting it normally. Though
> > > > > > we expect that either all of our requests will fail or all will
> > > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > > > > robust.
> > > > > >
> > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > > > ---
> > > > > >
> > > > > >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > > > > >  1 file changed, 12 insertions(+), 7 deletions(-)
> > > > >
> > > > > I'm not too keen on falling back here when we have no idea why the request failed.
> > > > >
> > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > > > > account for the case of broken FW, e.g. as below.
> > > > >
> > > > > Mark.
> > > > >
> > > > > ---->8----
> > > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > > > > From: Mark Rutland <mark.rutland@arm.com>
> > > > > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> > > > >
> > > > > Some MediaTek devices have broken firmware which corrupts some GICR
> > > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > > > > these devices. For more details see commit:
> > > > >
> > > > >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> > > > >
> > > > > We did not take this problem into account in commit:
> > > > >
> > > > >   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > >
> > > > > Since that commit arm64's SMP code will try to setup some IPIs as
> > > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > > > > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > > > > resulting in boot-time failures.
> > > > >
> > > > > Avoid the problem by taking the broken FW into account when deciding to
> > > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > > > > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > > > > and we can consult this within ipi_should_be_nmi().
> > > > >
> > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > > > > Cc: Douglas Anderson <dianders@chromium.org>
> > > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > > ---
> > > > >  arch/arm64/kernel/smp.c      | 5 ++++-
> > > > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > Sure, this is OK w/ me as long as folks don't mind accessing the
> > > > global here, it's OK w/ me:
> > > >
> > > > Reviewed-by: Douglas Anderson <dianders@chromium.org>
> > > >
> > > > It seems to work for me, thus:
> > > >
> > > > Tested-by: Douglas Anderson <dianders@chromium.org>
> > > >
> > > >
> > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > index 814d9aa93b21b..061c69160f90f 100644
> > > > > --- a/arch/arm64/kernel/smp.c
> > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> > > > >
> > > > >  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> > > > >  {
> > > > > -       if (!system_uses_irq_prio_masking())
> > > > > +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > > > > +
> > > > > +       if (!system_uses_irq_prio_masking() ||
> > > > > +           !static_branch_likely(&supports_pseudo_nmis))
> > > >
> > > > One thought, actually, is whether we should actually change
> > > > system_uses_irq_prio_masking() to return the correct value. What do
> > > > you think?
> > >
> > > I don't think we should add this to system_uses_irq_prio_masking(); that's used
> > > by the low-level flags manipulation code that gets inlined all over the place,
> > > and that code will work regarldess of whether we actually use NMI priorities.
> > >
> > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to
> > > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().
> >
> > I suspect that anyone trying to use PMR masking on these systems for
> > any purpose will be unhappy. The issue is talked about in:
> >
> > https://issuetracker.google.com/281831288
> >
> > ...where you can see that the firmware on these systems isn't properly
> > saving/restoring some registers, including GICR_IPRIORITYR.
>
> Ok, then that's a latent bug even before the IPI changes, going back to the
> original workaround in commit:
>
>   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
>
> For the sake of those reading the archive, can we have a better description of
> what exactly happens on these boards?
>
> IIUC on these boards the firmware fails to save+restore (some?) GICR registers
> across (some?) PSCI CPU_SUSPEND idle states.
>
> Which registers does it save+restore?
>
> Does it reset other registers into a specific state?

Though I'm not an expert in this area, my understanding is that in
some of the deeper idle states then GICR registers are lost. That
matches a thread [0] I found. In early investigation I found that I
could comment out `cpu-idle-states` in the device tree to avoid the
problems [1]. I believe this is fully expected which is why firmware
is supposed to save/restore these registers whenever a low power is
entered/exited. I'd presume that any register not properly
saved/restored comes up in whatever its default state is.

As far as pseudo-NMI was concerned, all I really needed to
save/restore was "GICR_NUM_IPRIORITYR" [2], but Marc Zyngier looked at
the code and identified [3] at least these in addition:
* GICR_CTLR
* GICR_ISPENDR0
* GICR_ISACTIVER0
* GICR_NSACR

That list seems to match the Arm Trusted Firmware patch that fixed the
issue [4]. ...but it will be impossible to ever get the fix rolled out
to all devices. Even if we could get firmware spins Qualified for
every device there will still be cases where we'll boot with the old
firmware. Since we _don't_ bundle the device tree with the firmware,
we believe that the quirk mechanism that we came up with (add a quirk
in never device trees and firmware removes the quirk when we have a
fix) is at least a robust/reliable way to detect the issue.

The whole issue seems rather concerning, but (apparently) it never
caused issues in the kernel until we tried to use pseudo-NMI.

[0] https://github.com/ARM-software/tf-issues/issues/464
[1] https://issuetracker.google.com/issues/197061987#comment27
[2] https://crrev.com/c/4519877
[3] https://issuetracker.google.com/issues/281831288#comment5
[4] https://github.com/ARM-software/arm-trusted-firmware/commit/1c62cc7fbdf2ec2a7e69b3c027d507fcafdcaa12
  
Mark Rutland Oct. 4, 2023, 9:59 a.m. UTC | #6
On Tue, Oct 03, 2023 at 12:32:39PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Tue, Oct 3, 2023 at 9:32 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Oct 03, 2023 at 06:43:07AM -0700, Doug Anderson wrote:
> > > Hi,
> > >
> > > On Tue, Oct 3, 2023 at 5:29 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> > > > > Hi,
> > > > >
> > > > > On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > > > > > Some mediatek devices have the property
> > > > > > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > > > > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > > > > > that blocks enabling it. When we're in this state,
> > > > > > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > > > > > actually enable the IRQ in the GIC.
> > > > > > >
> > > > > > > Let's make the code handle this. We'll detect that we failed to
> > > > > > > request an IPI as NMI and fallback to requesting it normally. Though
> > > > > > > we expect that either all of our requests will fail or all will
> > > > > > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > > > > > robust.
> > > > > > >
> > > > > > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > > > > ---
> > > > > > >
> > > > > > >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > > > > > >  1 file changed, 12 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > I'm not too keen on falling back here when we have no idea why the request failed.
> > > > > >
> > > > > > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > > > > > account for the case of broken FW, e.g. as below.
> > > > > >
> > > > > > Mark.
> > > > > >
> > > > > > ---->8----
> > > > > > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > > > > > From: Mark Rutland <mark.rutland@arm.com>
> > > > > > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > > > > > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> > > > > >
> > > > > > Some MediaTek devices have broken firmware which corrupts some GICR
> > > > > > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > > > > > these devices. For more details see commit:
> > > > > >
> > > > > >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> > > > > >
> > > > > > We did not take this problem into account in commit:
> > > > > >
> > > > > >   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > >
> > > > > > Since that commit arm64's SMP code will try to setup some IPIs as
> > > > > > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > > > > > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > > > > > resulting in boot-time failures.
> > > > > >
> > > > > > Avoid the problem by taking the broken FW into account when deciding to
> > > > > > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > > > > > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > > > > > and we can consult this within ipi_should_be_nmi().
> > > > > >
> > > > > > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > > > > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > > > > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > > > > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > > > > > Cc: Douglas Anderson <dianders@chromium.org>
> > > > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > > > ---
> > > > > >  arch/arm64/kernel/smp.c      | 5 ++++-
> > > > > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > > > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > Sure, this is OK w/ me as long as folks don't mind accessing the
> > > > > global here, it's OK w/ me:
> > > > >
> > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org>
> > > > >
> > > > > It seems to work for me, thus:
> > > > >
> > > > > Tested-by: Douglas Anderson <dianders@chromium.org>
> > > > >
> > > > >
> > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > > > index 814d9aa93b21b..061c69160f90f 100644
> > > > > > --- a/arch/arm64/kernel/smp.c
> > > > > > +++ b/arch/arm64/kernel/smp.c
> > > > > > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> > > > > >
> > > > > >  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> > > > > >  {
> > > > > > -       if (!system_uses_irq_prio_masking())
> > > > > > +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > > > > > +
> > > > > > +       if (!system_uses_irq_prio_masking() ||
> > > > > > +           !static_branch_likely(&supports_pseudo_nmis))
> > > > >
> > > > > One thought, actually, is whether we should actually change
> > > > > system_uses_irq_prio_masking() to return the correct value. What do
> > > > > you think?
> > > >
> > > > I don't think we should add this to system_uses_irq_prio_masking(); that's used
> > > > by the low-level flags manipulation code that gets inlined all over the place,
> > > > and that code will work regarldess of whether we actually use NMI priorities.
> > > >
> > > > If we want to avoid using PMR masking *at all* on these platforms, we'd need to
> > > > detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().
> > >
> > > I suspect that anyone trying to use PMR masking on these systems for
> > > any purpose will be unhappy. The issue is talked about in:
> > >
> > > https://issuetracker.google.com/281831288
> > >
> > > ...where you can see that the firmware on these systems isn't properly
> > > saving/restoring some registers, including GICR_IPRIORITYR.
> >
> > Ok, then that's a latent bug even before the IPI changes, going back to the
> > original workaround in commit:
> >
> >   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> >
> > For the sake of those reading the archive, can we have a better description of
> > what exactly happens on these boards?
> >
> > IIUC on these boards the firmware fails to save+restore (some?) GICR registers
> > across (some?) PSCI CPU_SUSPEND idle states.
> >
> > Which registers does it save+restore?
> >
> > Does it reset other registers into a specific state?
> 
> Though I'm not an expert in this area, my understanding is that in
> some of the deeper idle states then GICR registers are lost. That
> matches a thread [0] I found. In early investigation I found that I
> could comment out `cpu-idle-states` in the device tree to avoid the
> problems [1]. I believe this is fully expected which is why firmware
> is supposed to save/restore these registers whenever a low power is
> entered/exited. I'd presume that any register not properly
> saved/restored comes up in whatever its default state is.
> 
> As far as pseudo-NMI was concerned, all I really needed to
> save/restore was "GICR_NUM_IPRIORITYR" [2], but Marc Zyngier looked at
> the code and identified [3] at least these in addition:
> * GICR_CTLR
> * GICR_ISPENDR0
> * GICR_ISACTIVER0
> * GICR_NSACR

Looking at the GIC spec (Arm IHI 0069H), page 12-673, I see for all of the
GICR_IPRIORITYR<n>.Priority_offset_*B fields:

| The reset behavior of this field is:
| • On a GIC reset, this field resets to an architecturally UNKNOWN value.

... so at least per the architecture these could be reset to arbitrary values,

and that priority might permit SGI/PPIs to be taken IRQs are priority-masked,
or to prevent SGI/PPIs to be taken when priority-unmasked.

I also see for GICR_CTLR that EnableLPIs would be reset to 0, and IIUC that
means LPIs won't work on these parts, which seems like a problem.

> That list seems to match the Arm Trusted Firmware patch that fixed the
> issue [4]. ...but it will be impossible to ever get the fix rolled out
> to all devices. Even if we could get firmware spins Qualified for
> every device there will still be cases where we'll boot with the old
> firmware. Since we _don't_ bundle the device tree with the firmware,
> we believe that the quirk mechanism that we came up with (add a quirk
> in never device trees and firmware removes the quirk when we have a
> fix) is at least a robust/reliable way to detect the issue.
> 
> The whole issue seems rather concerning, but (apparently) it never
> caused issues in the kernel until we tried to use pseudo-NMI.

Given you haven't seen any issues, I suspect those are getting reset to fixed
values that happens to work out for us, but it is a bit worrisome more
generally (e.g. the LPI case above).

Mark.

> 
> [0] https://github.com/ARM-software/tf-issues/issues/464
> [1] https://issuetracker.google.com/issues/197061987#comment27
> [2] https://crrev.com/c/4519877
> [3] https://issuetracker.google.com/issues/281831288#comment5
> [4] https://github.com/ARM-software/arm-trusted-firmware/commit/1c62cc7fbdf2ec2a7e69b3c027d507fcafdcaa12
  
Marc Zyngier Oct. 4, 2023, 10:15 a.m. UTC | #7
On Wed, 04 Oct 2023 10:59:50 +0100,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> Given you haven't seen any issues, I suspect those are getting reset to fixed
> values that happens to work out for us, but it is a bit worrisome more
> generally (e.g. the LPI case above).

It is likely that these SoCs don't even have an ITS.

	M.
  
Doug Anderson Oct. 4, 2023, 2:04 p.m. UTC | #8
Hi,

On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Wed, 04 Oct 2023 10:59:50 +0100,
> Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Given you haven't seen any issues, I suspect those are getting reset to fixed
> > values that happens to work out for us, but it is a bit worrisome more
> > generally (e.g. the LPI case above).
>
> It is likely that these SoCs don't even have an ITS.

Right. That was what we decided [1] when Marc pointed this out earlier.

Overall: we know that this firmware behavior is not good but we're
stuck with it. :( At the very least, any new devices coming out will
have this fixed. Presumably if old devices are working OK enough today
(as long as you don't enable pseudo-NMI) then they can be made to keep
working?

So circling back: what patch should we actually land? As of right now
only pseudo-NMI is broken, but it would be good to make sure that if
the kernel later adds other features that would be broken on this
hardware that it gets handled properly...

[1] https://issuetracker.google.com/issues/281831288#comment4
  
Mark Rutland Oct. 5, 2023, 10:27 a.m. UTC | #9
On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote:
> On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Wed, 04 Oct 2023 10:59:50 +0100,
> > Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Given you haven't seen any issues, I suspect those are getting reset to fixed
> > > values that happens to work out for us, but it is a bit worrisome more
> > > generally (e.g. the LPI case above).
> >
> > It is likely that these SoCs don't even have an ITS.
> 
> Right. That was what we decided [1] when Marc pointed this out earlier.
> 
> Overall: we know that this firmware behavior is not good but we're
> stuck with it. :( At the very least, any new devices coming out will
> have this fixed. Presumably if old devices are working OK enough today
> (as long as you don't enable pseudo-NMI) then they can be made to keep
> working?
> 
> So circling back: what patch should we actually land?

For now I'd prefer we took the patch I sent in:

  https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/

... as that leaves us no worse than before this series, and it's pretty simple.

> As of right now only pseudo-NMI is broken, but it would be good to make sure
> that if the kernel later adds other features that would be broken on this
> hardware that it gets handled properly...

Going further than the above, I think there are three options here:

1) Complete fix: depend on a working firmware, and throw this workaround away.

   IIUC from the above, that's not something you can commit to.

2) Partial fix: have the kernel save/restore everything.

   IIUC this is unpalatable.

3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the
   absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe
   we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities().

   That'll avoid potential issues if/when we change the priorities used for
   pNMI (which is something I've been looking at).

I'm happy with (3) if Marc is.

Mark.
  
Doug Anderson Oct. 5, 2023, 3:34 p.m. UTC | #10
Hi,

On Thu, Oct 5, 2023 at 3:27 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote:
> > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Wed, 04 Oct 2023 10:59:50 +0100,
> > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > Given you haven't seen any issues, I suspect those are getting reset to fixed
> > > > values that happens to work out for us, but it is a bit worrisome more
> > > > generally (e.g. the LPI case above).
> > >
> > > It is likely that these SoCs don't even have an ITS.
> >
> > Right. That was what we decided [1] when Marc pointed this out earlier.
> >
> > Overall: we know that this firmware behavior is not good but we're
> > stuck with it. :( At the very least, any new devices coming out will
> > have this fixed. Presumably if old devices are working OK enough today
> > (as long as you don't enable pseudo-NMI) then they can be made to keep
> > working?
> >
> > So circling back: what patch should we actually land?
>
> For now I'd prefer we took the patch I sent in:
>
>   https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/
>
> ... as that leaves us no worse than before this series, and it's pretty simple.

Sounds good to me!

Catalin / Will: Please yell if there's anything you need me to do.
Otherwise I'll assume you'll pick up Mark's patch instead of my patch
#1 and then you'll pick up my patch #2.


> > As of right now only pseudo-NMI is broken, but it would be good to make sure
> > that if the kernel later adds other features that would be broken on this
> > hardware that it gets handled properly...
>
> Going further than the above, I think there are three options here:
>
> 1) Complete fix: depend on a working firmware, and throw this workaround away.
>
>    IIUC from the above, that's not something you can commit to.

Right. We've landed the fix in the firmware branch for many of the
devices that had the issue but there's a whole process (and cost
involved) in getting this actually rolled out. Each unique board needs
to kick off a FW qual. Given that nothing is actually broken today
it's hard to justify a FW qual just for this, so we're left
piggybacking on the next reason for a FW qual (if there is one).
...even if we could kick them off all instantly, though, it's always
best not to rely on a FW fix, especially if (as in this case) it's to
keep us from crashing. There'll always be some case where someone
tries to boot a new OS with an old firmware. One such case can happen
when booting recovery images on ChromeOS where the device always boots
from the Read-only (not updatable) firmware.


> 2) Partial fix: have the kernel save/restore everything.
>
>    IIUC this is unpalatable.

Yeah, Marc already NAKed it.


> 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the
>    absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe
>    we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities().
>
>    That'll avoid potential issues if/when we change the priorities used for
>    pNMI (which is something I've been looking at).
>
> I'm happy with (3) if Marc is.

Yeah, that seems best to me as a long term solution, too.

-Doug
  
Marc Zyngier Oct. 6, 2023, 9:55 a.m. UTC | #11
On Mon, 02 Oct 2023 18:24:11 +0100,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > Some mediatek devices have the property
> > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > although the hardware supports pseudo-NMI, the firmware has a bug
> > that blocks enabling it. When we're in this state,
> > system_uses_irq_prio_masking() will return true but we'll fail to
> > actually enable the IRQ in the GIC.
> > 
> > Let's make the code handle this. We'll detect that we failed to
> > request an IPI as NMI and fallback to requesting it normally. Though
> > we expect that either all of our requests will fail or all will
> > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > robust.
> > 
> > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> > 
> >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> I'm not too keen on falling back here when we have no idea why the request failed.
> 
> I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> account for the case of broken FW, e.g. as below.
> 
> Mark.
> 
> ---->8----
> From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Mon, 2 Oct 2023 18:00:36 +0100
> Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> 
> Some MediaTek devices have broken firmware which corrupts some GICR
> registers behind the back of the OS, and pseudo-NMIs cannot be used on
> these devices. For more details see commit:
> 
>   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> 
> We did not take this problem into account in commit:
> 
>   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> 
> Since that commit arm64's SMP code will try to setup some IPIs as
> pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> (rightly) reject attempts to request interrupts as pseudo-NMIs,
> resulting in boot-time failures.
> 
> Avoid the problem by taking the broken FW into account when deciding to
> request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> named "supports_pseudo_nmis" which is false on systems with broken FW,
> and we can consult this within ipi_should_be_nmi().
> 
> Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> Closes: https://issuetracker.google.com/issues/197061987#comment68
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kernel/smp.c      | 5 ++++-
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 814d9aa93b21b..061c69160f90f 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
>  
>  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
>  {
> -	if (!system_uses_irq_prio_masking())
> +	DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> +
> +	if (!system_uses_irq_prio_masking() ||
> +	    !static_branch_likely(&supports_pseudo_nmis))
>  		return false;
>  
>  	switch (ipi) {
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 787ccc880b22d..737da1b9aabf2 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -106,7 +106,7 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
>   * - Figure 4-7 Secure read of the priority field for a Non-secure Group 1
>   *   interrupt.
>   */
> -static DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> +DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis);
>  
>  DEFINE_STATIC_KEY_FALSE(gic_nonsecure_priorities);
>  EXPORT_SYMBOL(gic_nonsecure_priorities);

This last hunk is going to result in more spam from the robots about
global objects without a previous declaration. Not that I care the
least, but worth mentioning.

Other than that:

Reviewed-by: Marc Zyngier <maz@kernel.org>

Please take it via the arm64 tree with patch #1

Thanks,

	M.
  
Chen-Yu Tsai Oct. 6, 2023, 10:20 a.m. UTC | #12
On Tue, Oct 3, 2023 at 1:24 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > Some mediatek devices have the property
> > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > although the hardware supports pseudo-NMI, the firmware has a bug
> > that blocks enabling it. When we're in this state,
> > system_uses_irq_prio_masking() will return true but we'll fail to
> > actually enable the IRQ in the GIC.
> >
> > Let's make the code handle this. We'll detect that we failed to
> > request an IPI as NMI and fallback to requesting it normally. Though
> > we expect that either all of our requests will fail or all will
> > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > robust.
> >
> > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
>
> I'm not too keen on falling back here when we have no idea why the request failed.
>
> I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> account for the case of broken FW, e.g. as below.
>
> Mark.
>
> ---->8----
> From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Mon, 2 Oct 2023 18:00:36 +0100
> Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
>
> Some MediaTek devices have broken firmware which corrupts some GICR
> registers behind the back of the OS, and pseudo-NMIs cannot be used on
> these devices. For more details see commit:
>
>   44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
>
> We did not take this problem into account in commit:
>
>   331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
>
> Since that commit arm64's SMP code will try to setup some IPIs as
> pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> (rightly) reject attempts to request interrupts as pseudo-NMIs,
> resulting in boot-time failures.
>
> Avoid the problem by taking the broken FW into account when deciding to
> request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> named "supports_pseudo_nmis" which is false on systems with broken FW,
> and we can consult this within ipi_should_be_nmi().
>
> Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> Reported-by: Chen-Yu Tsai <wenst@chromium.org>
> Closes: https://issuetracker.google.com/issues/197061987#comment68
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Marc Zyngier <maz@kernel.org>

Tested-by: Chen-Yu Tsai <wenst@chromium.org>

> ---
>  arch/arm64/kernel/smp.c      | 5 ++++-
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 814d9aa93b21b..061c69160f90f 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
>
>  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
>  {
> -       if (!system_uses_irq_prio_masking())
> +       DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> +
> +       if (!system_uses_irq_prio_masking() ||
> +           !static_branch_likely(&supports_pseudo_nmis))
>                 return false;
>
>         switch (ipi) {
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 787ccc880b22d..737da1b9aabf2 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -106,7 +106,7 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
>   * - Figure 4-7 Secure read of the priority field for a Non-secure Group 1
>   *   interrupt.
>   */
> -static DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> +DEFINE_STATIC_KEY_FALSE(supports_pseudo_nmis);
>
>  DEFINE_STATIC_KEY_FALSE(gic_nonsecure_priorities);
>  EXPORT_SYMBOL(gic_nonsecure_priorities);
> --
> 2.30.2
>
>
  
Marc Zyngier Oct. 6, 2023, 10:20 a.m. UTC | #13
On Thu, 05 Oct 2023 11:27:26 +0100,
Mark Rutland <mark.rutland@arm.com> wrote:

> 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the
>    absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe
>    we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities().
> 
>    That'll avoid potential issues if/when we change the priorities used for
>    pNMI (which is something I've been looking at).
> 
> I'm happy with (3) if Marc is.

Definitely worth investigating if you have the bandwidth.

	M.
  
Catalin Marinas Oct. 6, 2023, 12:19 p.m. UTC | #14
On Thu, Oct 05, 2023 at 08:34:56AM -0700, Doug Anderson wrote:
> On Thu, Oct 5, 2023 at 3:27 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote:
> > > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > On Wed, 04 Oct 2023 10:59:50 +0100,
> > > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > > > Given you haven't seen any issues, I suspect those are getting reset to fixed
> > > > > values that happens to work out for us, but it is a bit worrisome more
> > > > > generally (e.g. the LPI case above).
> > > >
> > > > It is likely that these SoCs don't even have an ITS.
> > >
> > > Right. That was what we decided [1] when Marc pointed this out earlier.
> > >
> > > Overall: we know that this firmware behavior is not good but we're
> > > stuck with it. :( At the very least, any new devices coming out will
> > > have this fixed. Presumably if old devices are working OK enough today
> > > (as long as you don't enable pseudo-NMI) then they can be made to keep
> > > working?
> > >
> > > So circling back: what patch should we actually land?
> >
> > For now I'd prefer we took the patch I sent in:
> >
> >   https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@FVFF77S0Q05N.cambridge.arm.com/
> >
> > ... as that leaves us no worse than before this series, and it's pretty simple.
> 
> Sounds good to me!
> 
> Catalin / Will: Please yell if there's anything you need me to do.
> Otherwise I'll assume you'll pick up Mark's patch instead of my patch
> #1 and then you'll pick up my patch #2.

I applied both to the arm64 for-next/backtrace-ipi branch. Thanks.
  
Doug Anderson Oct. 6, 2023, 10:17 p.m. UTC | #15
Hi,

On Fri, Oct 6, 2023 at 3:20 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 05 Oct 2023 11:27:26 +0100,
> Mark Rutland <mark.rutland@arm.com> wrote:
>
> > 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the
> >    absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe
> >    we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities().
> >
> >    That'll avoid potential issues if/when we change the priorities used for
> >    pNMI (which is something I've been looking at).
> >
> > I'm happy with (3) if Marc is.
>
> Definitely worth investigating if you have the bandwidth.

I made an attempt at it. See:

https://lore.kernel.org/20231006151547.1.Ide945748593cffd8ff0feb9ae22b795935b944d6@changeid
  

Patch

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 814d9aa93b21..0a6002243a8c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -87,6 +87,7 @@  enum ipi_msg_type {
 static int ipi_irq_base __ro_after_init;
 static int nr_ipi __ro_after_init = NR_IPI;
 static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
+DECLARE_BITMAP(ipi_is_nmi, MAX_IPI);
 
 static void ipi_setup(int cpu);
 
@@ -986,7 +987,7 @@  static void ipi_setup(int cpu)
 		return;
 
 	for (i = 0; i < nr_ipi; i++) {
-		if (ipi_should_be_nmi(i)) {
+		if (test_bit(i, ipi_is_nmi)) {
 			prepare_percpu_nmi(ipi_irq_base + i);
 			enable_percpu_nmi(ipi_irq_base + i, 0);
 		} else {
@@ -1004,7 +1005,7 @@  static void ipi_teardown(int cpu)
 		return;
 
 	for (i = 0; i < nr_ipi; i++) {
-		if (ipi_should_be_nmi(i)) {
+		if (test_bit(i, ipi_is_nmi)) {
 			disable_percpu_nmi(ipi_irq_base + i);
 			teardown_percpu_nmi(ipi_irq_base + i);
 		} else {
@@ -1022,17 +1023,21 @@  void __init set_smp_ipi_range(int ipi_base, int n)
 	nr_ipi = min(n, MAX_IPI);
 
 	for (i = 0; i < nr_ipi; i++) {
-		int err;
+		int err = -EINVAL;
 
 		if (ipi_should_be_nmi(i)) {
 			err = request_percpu_nmi(ipi_base + i, ipi_handler,
 						 "IPI", &cpu_number);
-			WARN(err, "Could not request IPI %d as NMI, err=%d\n",
-			     i, err);
-		} else {
+			if (err)
+				pr_info_once("NMI unavailable; fallback to IRQ\n");
+			else
+				set_bit(i, ipi_is_nmi);
+		}
+
+		if (err) {
 			err = request_percpu_irq(ipi_base + i, ipi_handler,
 						 "IPI", &cpu_number);
-			WARN(err, "Could not request IPI %d as IRQ, err=%d\n",
+			WARN(err, "Could not request IPI %d, err=%d\n",
 			     i, err);
 		}