[0/6] Cure kexec() vs. mwait_play_dead() troubles

Message ID	20230603193439.502645149@linutronix.de
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230603193439.502645149@linutronix.de> From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Ashok Raj <ashok.raj@linux.intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Arjan van de Veen <arjan@linux.intel.com>, Peter Zijlstra <peterz@infradead.org>, Eric Biederman <ebiederm@xmission.com> Subject: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Date: Sat, 3 Jun 2023 22:06:54 +0200 (CEST) Precedence: bulk
Series	Cure kexec() vs. mwait_play_dead() troubles \| [0/6] Cure kexec() vs. mwait_play_dead() troubles [1/6] x86/smp: Remove pointless wmb() from native_stop_other_cpus() [2/6] x86/smp: Acquire stopping_cpu unconditionally [3/6] x86/smp: Use dedicated cache-line for mwait_play_dead() [4/6] x86/smp: Cure kexec() vs. mwait_play_dead() breakage [5/6] x86/smp: Split sending INIT IPI out into a helper function [6/6] x86/smp: Put CPUs into INIT on shutdown if possible

Message ID

20230603193439.502645149@linutronix.de

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Message-ID: <20230603193439.502645149@linutronix.de>
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org, Ashok Raj <ashok.raj@linux.intel.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Tony Luck <tony.luck@intel.com>,
        Arjan van de Veen <arjan@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Eric Biederman <ebiederm@xmission.com>
Subject: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles
Date: Sat,  3 Jun 2023 22:06:54 +0200 (CEST)
Precedence: bulk

Series

Cure kexec() vs. mwait_play_dead() troubles |

Message

Thomas Gleixner June 3, 2023, 8:06 p.m. UTC

  Hi!

Ashok observed triple faults when executing kexec() on a kernel which has
'nosmt' on the kernel commandline and HT enabled in the BIOS.

'nosmt' brings up the HT siblings to the point where they initiliazed the
CPU and then rolls the bringup back which parks them in mwait_play_dead().
The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast
MCE will immediately shut down the machine.

Some detective work revealed that:

  1) The kexec kernel can overwrite text, pagetables, stack and data of the
     previous kernel.

  2) If the kexec kernel writes to the memory which is monitored by an
     "offline" CPU, that CPU resumes execution. That's obviously doomed
     when the kexec kernel overwrote text, pagetables, data or stack.

While on my test machine the first kexec() after reset always "worked", the
second one reliably ended up in a triple fault.

The following series cures this by:

  1) Bringing offline CPUs which are stuck in mwait_play_dead() out of
     mwait by writing to the monitored cacheline

  2) Let the woken up CPUs check the written control word and drop into
     a HLT loop if the control word requests so.

This is only half safe because HLT can resume execution due to NMI, SMI and
MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
but there is at least one which prevents the NMI and SMI cause: INIT.

  3) If the system uses the regular INIT/STARTUP sequence to wake up
     secondary CPUS, then "park" all CPUs including the "offline" ones
     by sending them INIT IPIs.

The INIT IPI brings the CPU into a wait for wakeup state which is not
affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE
problem comes back.

But that's not really any different from a CPU sitting in the HLT loop on
the previous kernel. If a broadcast MCE arrives, HLT resumes execution and
the CPU tries to handle the MCE on overwritten text, pagetables etc.

So parking them via INIT is not completely solving the problem, but it
takes at least NMI and SMI out of the picture.

The series is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec

Thanks,

	tglx
---
 include/asm/smp.h |    4 +
 kernel/smp.c      |   62 +++++++++++++---------
 kernel/smpboot.c  |  151 ++++++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 156 insertions(+), 61 deletions(-)

Comments

Sean Christopherson June 5, 2023, 5:41 p.m. UTC | #1

On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> Hi!
> 
> Ashok observed triple faults when executing kexec() on a kernel which has
> 'nosmt' on the kernel commandline and HT enabled in the BIOS.
> 
> 'nosmt' brings up the HT siblings to the point where they initiliazed the
> CPU and then rolls the bringup back which parks them in mwait_play_dead().
> The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast
> MCE will immediately shut down the machine.

...

> This is only half safe because HLT can resume execution due to NMI, SMI and
> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,

On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
*might* cause shutdown, but at least there's a chance it will work.  And presumably
modern CPUs do pend the #MC until GIF=1.

> but there is at least one which prevents the NMI and SMI cause: INIT.
> 
>   3) If the system uses the regular INIT/STARTUP sequence to wake up
>      secondary CPUS, then "park" all CPUs including the "offline" ones
>      by sending them INIT IPIs.
> 
> The INIT IPI brings the CPU into a wait for wakeup state which is not
> affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE
> problem comes back.
> 
> But that's not really any different from a CPU sitting in the HLT loop on
> the previous kernel. If a broadcast MCE arrives, HLT resumes execution and
> the CPU tries to handle the MCE on overwritten text, pagetables etc.
> 
> So parking them via INIT is not completely solving the problem, but it
> takes at least NMI and SMI out of the picture.

Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
potentially cause problems too?

Why not carve out a page that's hidden across kexec() to hold whatever code+data
is needed to safely execute a HLT loop indefinitely?  E.g. doesn't the original
kernel provide the e820 tables for the post-kexec() kernel?  To avoid OOM after
many kexec(), reserving a page could be done iff the current kernel wasn't itself
kexec()'d.

Thomas Gleixner June 5, 2023, 10:41 p.m. UTC | #2

On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> On Sat, Jun 03, 2023, Thomas Gleixner wrote:
>> This is only half safe because HLT can resume execution due to NMI, SMI and
>> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
>
> On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> *might* cause shutdown, but at least there's a chance it will work.  And presumably
> modern CPUs do pend the #MC until GIF=1.

Abusing SVME for that is definitely in the realm of creative bonus
points, but not necessarily a general purpose solution.

>> So parking them via INIT is not completely solving the problem, but it
>> takes at least NMI and SMI out of the picture.
>
> Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> potentially cause problems too?

Not that I'm aware of. If so then this would be a hideous firmware bug
as firmware must be aware of CPUs which hang around in INIT independent
of this.

> Why not carve out a page that's hidden across kexec() to hold whatever code+data
> is needed to safely execute a HLT loop indefinitely?

See below.

> E.g. doesn't the original kernel provide the e820 tables for the
> post-kexec() kernel?

Only for crash kernels if I'm not missing something.

Making this work for regular kexec() including this:

> To avoid OOM after many kexec(), reserving a page could be done iff
> the current kernel wasn't itself kexec()'d.

would be possible and I thought about it, but that needs a complete new
design of "offline", "shutdown offline" and a non-trivial amount of
backwards compatibility magic because you can't assume that the kexec()
kernel version is greater or equal to the current one. kexec() is
supposed to work both ways, downgrading and upgrading. IOW, that ship
sailed long ago.

Thanks,

        tglx

Sean Christopherson June 5, 2023, 11:08 p.m. UTC | #3

On Tue, Jun 06, 2023, Thomas Gleixner wrote:
> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> >> This is only half safe because HLT can resume execution due to NMI, SMI and
> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> >
> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > modern CPUs do pend the #MC until GIF=1.
> 
> Abusing SVME for that is definitely in the realm of creative bonus
> points, but not necessarily a general purpose solution.

Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-)

> >> So parking them via INIT is not completely solving the problem, but it
> >> takes at least NMI and SMI out of the picture.
> >
> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > potentially cause problems too?
> 
> Not that I'm aware of. If so then this would be a hideous firmware bug
> as firmware must be aware of CPUs which hang around in INIT independent
> of this.

I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g.
SmmWaitForApArrival().  I've never dug deeply into how EDKII uses SMM, what its
timeouts are, etc., I just remember coming across that code when poking around
EDKII for other stuff.

> > Why not carve out a page that's hidden across kexec() to hold whatever code+data
> > is needed to safely execute a HLT loop indefinitely?
> 
> See below.
> 
> > E.g. doesn't the original kernel provide the e820 tables for the
> > post-kexec() kernel?
> 
> Only for crash kernels if I'm not missing something.

Ah, drat.

> Making this work for regular kexec() including this:
> 
> > To avoid OOM after many kexec(), reserving a page could be done iff
> > the current kernel wasn't itself kexec()'d.
> 
> would be possible and I thought about it, but that needs a complete new
> design of "offline", "shutdown offline" and a non-trivial amount of
> backwards compatibility magic because you can't assume that the kexec()
> kernel version is greater or equal to the current one. kexec() is
> supposed to work both ways, downgrading and upgrading. IOW, that ship
> sailed long ago.

Right, but doesn't gaining "full" protection require ruling out unenlightened
downgrades?  E.g. if someone downgrades to an old kernel, doesn't hide the "offline"
CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever,
then that old kernel will do the naive MWAIT or unprotected HLT and it's hosed again.

If we're relying on the admin to hide the offline CPUs, could we usurp an existing
kernel param to hide a small chunk of memory instead?

Thomas Gleixner June 6, 2023, 7:20 a.m. UTC | #4

On Mon, Jun 05 2023 at 16:08, Sean Christopherson wrote:
> On Tue, Jun 06, 2023, Thomas Gleixner wrote:
>> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
>> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
>> >> This is only half safe because HLT can resume execution due to NMI, SMI and
>> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
>> >
>> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
>> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
>> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
>> > modern CPUs do pend the #MC until GIF=1.
>> 
>> Abusing SVME for that is definitely in the realm of creative bonus
>> points, but not necessarily a general purpose solution.
>
> Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-)

I feared that :)

>> >> So parking them via INIT is not completely solving the problem, but it
>> >> takes at least NMI and SMI out of the picture.
>> >
>> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
>> > potentially cause problems too?
>> 
>> Not that I'm aware of. If so then this would be a hideous firmware bug
>> as firmware must be aware of CPUs which hang around in INIT independent
>> of this.
>
> I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g.
> SmmWaitForApArrival().  I've never dug deeply into how EDKII uses SMM, what its
> timeouts are, etc., I just remember coming across that code when poking around
> EDKII for other stuff.

There is a comment:

  Note the SMI Handlers must ALWAYS take into account the cases that not
  all APs are available in an SMI run.

Also not all SMIs required global synchronization. But it's all an
inpenetrable mess...

>> Making this work for regular kexec() including this:
>> 
>> > To avoid OOM after many kexec(), reserving a page could be done iff
>> > the current kernel wasn't itself kexec()'d.
>> 
>> would be possible and I thought about it, but that needs a complete new
>> design of "offline", "shutdown offline" and a non-trivial amount of
>> backwards compatibility magic because you can't assume that the kexec()
>> kernel version is greater or equal to the current one. kexec() is
>> supposed to work both ways, downgrading and upgrading. IOW, that ship
>> sailed long ago.
>
> Right, but doesn't gaining "full" protection require ruling out unenlightened
> downgrades?  E.g. if someone downgrades to an old kernel, doesn't hide the "offline"
> CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever,
> then that old kernel will do the naive MWAIT or unprotected HLT and
> it's hosed again.

Of course.

> If we're relying on the admin to hide the offline CPUs, could we usurp
> an existing kernel param to hide a small chunk of memory instead?

The only "safe" place is below 1M I think. Not sure whether we have
some existing command line option to "hide" a range there. Neither am I
sure that this would be always the same range.

More questions than answers :)

Thanks

        tglx

Ashok Raj June 7, 2023, 4:21 p.m. UTC | #5

On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> >> This is only half safe because HLT can resume execution due to NMI, SMI and
> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> >
> > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > modern CPUs do pend the #MC until GIF=1.
> 
> Abusing SVME for that is definitely in the realm of creative bonus
> points, but not necessarily a general purpose solution.
> 
> >> So parking them via INIT is not completely solving the problem, but it
> >> takes at least NMI and SMI out of the picture.
> >
> > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > potentially cause problems too?
> 
> Not that I'm aware of. If so then this would be a hideous firmware bug
> as firmware must be aware of CPUs which hang around in INIT independent
> of this.

SMM does do the rendezvous of all CPUs, but also has a way to detect the
blocked ones, in WFS via some package scoped ubox register. So it knows to
skip those. I can find this in internal sources, but they aren't available
in the edk2 open reference code. They happen to be documented only in the
BWG, which isn't available freely.

I believe its behind the GetSmmDelayedBlockedDisabledCount()->
	SmmCpuFeaturesGetSmmRegister()

Sean Christopherson June 7, 2023, 5:33 p.m. UTC | #6

On Wed, Jun 07, 2023, Ashok Raj wrote:
> On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> > >> This is only half safe because HLT can resume execution due to NMI, SMI and
> > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> > >
> > > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > > modern CPUs do pend the #MC until GIF=1.
> > 
> > Abusing SVME for that is definitely in the realm of creative bonus
> > points, but not necessarily a general purpose solution.
> > 
> > >> So parking them via INIT is not completely solving the problem, but it
> > >> takes at least NMI and SMI out of the picture.
> > >
> > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > potentially cause problems too?
> > 
> > Not that I'm aware of. If so then this would be a hideous firmware bug
> > as firmware must be aware of CPUs which hang around in INIT independent
> > of this.
> 
> SMM does do the rendezvous of all CPUs, but also has a way to detect the
> blocked ones, in WFS via some package scoped ubox register. So it knows to
> skip those. I can find this in internal sources, but they aren't available
> in the edk2 open reference code. They happen to be documented only in the
> BWG, which isn't available freely.

Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.

Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.

Ashok Raj June 7, 2023, 10:19 p.m. UTC | #7

On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote:
> On Wed, Jun 07, 2023, Ashok Raj wrote:
> > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote:
> > > > On Sat, Jun 03, 2023, Thomas Gleixner wrote:
> > > >> This is only half safe because HLT can resume execution due to NMI, SMI and
> > > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
> > > >
> > > > On Intel.  On AMD, enabling EFER.SVME and doing CLGI will block everything except
> > > > single-step #DB (lol) and RESET.  #MC handling is implementation-dependent and
> > > > *might* cause shutdown, but at least there's a chance it will work.  And presumably
> > > > modern CPUs do pend the #MC until GIF=1.
> > > 
> > > Abusing SVME for that is definitely in the realm of creative bonus
> > > points, but not necessarily a general purpose solution.
> > > 
> > > >> So parking them via INIT is not completely solving the problem, but it
> > > >> takes at least NMI and SMI out of the picture.
> > > >
> > > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > > potentially cause problems too?
> > > 
> > > Not that I'm aware of. If so then this would be a hideous firmware bug
> > > as firmware must be aware of CPUs which hang around in INIT independent
> > > of this.
> > 
> > SMM does do the rendezvous of all CPUs, but also has a way to detect the
> > blocked ones, in WFS via some package scoped ubox register. So it knows to
> > skip those. I can find this in internal sources, but they aren't available
> > in the edk2 open reference code. They happen to be documented only in the
> > BWG, which isn't available freely.
> 
> Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure

Never knew SMM had any role in VM's.. I thought SMM was always native. 

Who owns this SMM for VM's.. from the VirtualBIOS?

> boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.
> 
> Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.

I always seem to turn off secureboot installing Ubuntu :-).. I'll try to
find someone who might know especially doing SMM In VM. 

Can you tell what needs to be validated in the guest?  Would doing kexec
inside the guest with the new patch set be sufficient? 

Or you mean in guest, do a kexec and launch secure boot of new kernel?

If there is a specific test you want done, let me know.

Sean Christopherson June 8, 2023, 3:46 a.m. UTC | #8

On Wed, Jun 07, 2023, Ashok Raj wrote:
> On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote:
> > On Wed, Jun 07, 2023, Ashok Raj wrote:
> > > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote:
> > > > >> So parking them via INIT is not completely solving the problem, but it
> > > > >> takes at least NMI and SMI out of the picture.
> > > > >
> > > > > Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs indefinitely
> > > > > potentially cause problems too?
> > > > 
> > > > Not that I'm aware of. If so then this would be a hideous firmware bug
> > > > as firmware must be aware of CPUs which hang around in INIT independent
> > > > of this.
> > > 
> > > SMM does do the rendezvous of all CPUs, but also has a way to detect the
> > > blocked ones, in WFS via some package scoped ubox register. So it knows to
> > > skip those. I can find this in internal sources, but they aren't available
> > > in the edk2 open reference code. They happen to be documented only in the
> > > BWG, which isn't available freely.
> > 
> > Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> > bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
> 
> Never knew SMM had any role in VM's.. I thought SMM was always native. 
> 
> Who owns this SMM for VM's.. from the VirtualBIOS?

Yes?

> > boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.
> > 
> > Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.
> 
> I always seem to turn off secureboot installing Ubuntu :-)

Yeah, I don't utilize it in any of my VMs either.

> I'll try to find someone who might know especially doing SMM In VM. 
> 
> Can you tell what needs to be validated in the guest?  Would doing kexec
> inside the guest with the new patch set be sufficient? 
> 
> Or you mean in guest, do a kexec and launch secure boot of new kernel?

Yes?  I don't actually have hands on experience with such a setup, I'm familiar
with it purely through bug reports, e.g. this one

https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com

> If there is a specific test you want done, let me know.

Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
to make sure this all works.  Like I said earlier, nice to have, but certainly not
necessary.

Ashok Raj June 8, 2023, 4:03 a.m. UTC | #9

On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> 
> Yes?  I don't actually have hands on experience with such a setup, I'm familiar
> with it purely through bug reports, e.g. this one
> 
> https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> 
> > If there is a specific test you want done, let me know.
> 
> Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> to make sure this all works.  Like I said earlier, nice to have, but certainly not
> necessary.

Thanks for the Link Sean. I'll do some followup.

Paolo Bonzini June 9, 2023, 8:40 a.m. UTC | #10

On 6/7/23 19:33, Sean Christopherson wrote:
Don't most SMM handlers rendezvous all CPUs?  I.e. won't blocking SMIs 
indefinitely
>>>> potentially cause problems too?
>>>
>>> Not that I'm aware of. If so then this would be a hideous firmware bug
>>> as firmware must be aware of CPUs which hang around in INIT independent
>>> of this.
>>
>> SMM does do the rendezvous of all CPUs, but also has a way to detect the
>> blocked ones, in WFS via some package scoped ubox register. So it knows to
>> skip those. I can find this in internal sources, but they aren't available
>> in the edk2 open reference code. They happen to be documented only in the
>> BWG, which isn't available freely.
> 
> Ah, so putting CPUs into WFS shouldn't result in odd delays.  At least not on
> bare metal.  Hmm, and AFAIK the primary use case for SMM in VMs is for secure
> boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish.

VMs do not have things like periodic or watchdog SMIs, they only enter 
SMM in response to IPIs or writes to 0xB1.

The writes to 0xB1 in turn should only happen from UEFI runtime services 
related to the UEFI variable store.  Another possibility could be ACPI 
bytecode from either DSDT or APEI; not implemented yet and very unlikely 
to happen in the future, but not impossible either.

Either way they should not happen before the kexec-ed kernel has brought 
up all CPUs.

Paolo

> Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.

Ashok Raj June 16, 2023, 3:07 p.m. UTC | #11

On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> 
> https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> 
> > If there is a specific test you want done, let me know.
> 
> Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> to make sure this all works.  Like I said earlier, nice to have, but certainly not
> necessary.

+ Vijay who was helping with testing this inside the VM.
+ Paolo, Laszlo 

I haven't found the exact method to test with secure boot/trusted boot yet.
But here is what we were able to test thus far.

Vijay was able to get OVMF recompiled with SMM included.

Thanks to Laszlo for pointing me in the right direction. And Paolo for
helping with some basic questions.

https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt

Surprisingly SMM emulation is sadly damn good :-) 

Recipe is to generate SMI by writing to port 0xb2. 

- On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34
  goes up by 1 on all logical CPUs.
- Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control
- Do another port 0xb2, we don't see any hangs
- Bring up SMT by echo on > control, and we can see even the offline CPUs
  got the SMI as indicated by MSR 0x34. (Which is as expected)

On guest, the only difference was when we turn on HT again, waking the CPUs
from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is
perfectly fine) All I was looking for was "no hang". And a normal kexec
with newly updated code works well inside a guest.

Would this qualify for the smoke test pass? I'll continue to look for a
secure boot install if this doesn't close it, just haven't landed at the
right spot yet.

Sean Christopherson June 16, 2023, 7 p.m. UTC | #12

On Fri, Jun 16, 2023, Ashok Raj wrote:
> On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote:
> > 
> > https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com
> > 
> > > If there is a specific test you want done, let me know.
> > 
> > Smoke testing is all I was thinking.  I wouldn't put too much effort into trying
> > to make sure this all works.  Like I said earlier, nice to have, but certainly not
> > necessary.
> 
> + Vijay who was helping with testing this inside the VM.
> + Paolo, Laszlo 
> 
> I haven't found the exact method to test with secure boot/trusted boot yet.
> But here is what we were able to test thus far.
> 
> Vijay was able to get OVMF recompiled with SMM included.
> 
> Thanks to Laszlo for pointing me in the right direction. And Paolo for
> helping with some basic questions.
> 
> https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt
> 
> Surprisingly SMM emulation is sadly damn good :-) 
> 
> Recipe is to generate SMI by writing to port 0xb2. 
> 
> - On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34
>   goes up by 1 on all logical CPUs.
> - Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control
> - Do another port 0xb2, we don't see any hangs
> - Bring up SMT by echo on > control, and we can see even the offline CPUs
>   got the SMI as indicated by MSR 0x34. (Which is as expected)
> 
> On guest, the only difference was when we turn on HT again, waking the CPUs
> from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is
> perfectly fine) All I was looking for was "no hang". And a normal kexec
> with newly updated code works well inside a guest.
> 
> Would this qualify for the smoke test pass? I'll continue to look for a
> secure boot install if this doesn't close it, just haven't landed at the
> right spot yet.

Good enough for me, thanks much!

Ashok Raj June 16, 2023, 7:03 p.m. UTC | #13

On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote:
> > Would this qualify for the smoke test pass? I'll continue to look for a
> > secure boot install if this doesn't close it, just haven't landed at the
> > right spot yet.
> 
> Good enough for me, thanks much!

Thanks a ton Sean.. if anything you have now scared me there is life for
SMM afterall even in a guest :-)... Completely took me by surprise!

Cheers,
Ashok

Sean Christopherson June 16, 2023, 7:08 p.m. UTC | #14

On Fri, Jun 16, 2023, Ashok Raj wrote:
> On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote:
> > > Would this qualify for the smoke test pass? I'll continue to look for a
> > > secure boot install if this doesn't close it, just haven't landed at the
> > > right spot yet.
> > 
> > Good enough for me, thanks much!
> 
> Thanks a ton Sean.. if anything you have now scared me there is life for
> SMM afterall even in a guest :-)

LOL, you and me both.  Thankfully GCE's implementation of Secure Boot for VMs
doesn't utilize SMM, so to a large extent I can close my eyes and plug my ears :-)