Message ID | 20230603193439.502645149@linutronix.de |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1832902vqr; Sat, 3 Jun 2023 13:27:44 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5aXrko2yOAljGeiGc1zOY7+5ASTZYCdYbY6td7LOTmqD3mrB77jLizhG0NTo8T0kleiDby X-Received: by 2002:a05:6358:7e90:b0:123:5465:9284 with SMTP id o16-20020a0563587e9000b0012354659284mr14622020rwn.4.1685824064486; Sat, 03 Jun 2023 13:27:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685824064; cv=none; d=google.com; s=arc-20160816; b=x9avWKCaItDjuTNaSFrgDDr5N9NoUyuVM52ep2zwvxFehwVBPuCSY7GEMga/o5uYpE qmVrj4F2mY2C/DepZ4B1p0huB0v7t58E+uFhST5aCm7uyLL4DEaVNddVUdIBySs9RqEh zC8MQ6WKNSCe0Jblx7P09bd+k7E+K13VgeM2j9tmgPRyRf22dg22EbcHK3yDHYZfYPkq VW1gXMGPN8adoM+vohJPMfUZC+E9QEFlPt+mXvfsFoN+++ibaLl9i0h7IWzxBBENHT+W wwaejTX2aU+MEDSkHfRPahaXm4wDNACsM9d9iC5vhaREkDmd4fqE3rqf0fOy85NYBBSe XeUQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:date:subject:cc:to:from:dkim-signature :dkim-signature:message-id; bh=BQ88LTbW2bjMSOWAbmjSTzdR+6tslMVtAHZnogA/mqo=; b=fi5WZ8D1azcU3AdJfIPZXDvpJvenIAsxDLjCaNOc1Ox1PhZgHe15Pv1CoJ1V/n5eeN BGtLkQ3C5xdZ8FhEln1odDsXofkXqh/V/0z9QSocODiBxB6IR6Hd1uXnr5x7Y/qF8Tui I3XZ6Fq+yJzzw0Bvsi/i8gJnipI/jCKbZE4oVDjxyC1yRfc4mOz34kPsnaeOCEVvrKDP 7/ErkpjvCj20URz8KwQQNinA6XeYhyYr0Zqjv8bRj4kHQa9qtkwzWViuCyUW2sIzhezW 7x/jG96b+BwALPV8WA6BRQmmYi0zKP4pDkwvjW3kraUeVIMbFrQrwIQZqs+wZATRG7Sr QVbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=0qD3zraS; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o8-20020a17090ac70800b0024ddf3f8a0bsi4803395pjt.82.2023.06.03.13.27.32; Sat, 03 Jun 2023 13:27:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=0qD3zraS; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230455AbjFCUIA (ORCPT <rfc822;stefanalexe802@gmail.com> + 99 others); Sat, 3 Jun 2023 16:08:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51338 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230306AbjFCUHw (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 3 Jun 2023 16:07:52 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFC82E74 for <linux-kernel@vger.kernel.org>; Sat, 3 Jun 2023 13:07:24 -0700 (PDT) Message-ID: <20230603193439.502645149@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685822815; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=BQ88LTbW2bjMSOWAbmjSTzdR+6tslMVtAHZnogA/mqo=; b=0qD3zraSNnjHvv7+3PzzC8YBmPgbkc4VAPSWDD3ZIwZ1lTvtZfSogX4/icdtvTkiEuEXoy PTTcoG9sKMjxc8yCK90oq+DNWV+xWtPhLSOHkC9Ej3nC+ygWeCvNnCZ31mqB5I20ejOcN0 u8l+t04Qubv2Ay1kJlP7hqBtEb8ErKxeA3k+H5ppphPU5vgyZ0kCvRSJ/O5dVsILif+e0T TekJrZqurz7/4iRyvOOmBctOY7MXnVqgJqxD2OmbqNSudj/+08LHZc5eu4MiLp4EGF+3lp mv+OmFK6/OMOCnd2iwF1l0qer/X84HdkZrtc5GKiGzeCKYHpBPLleLF0DePyhw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685822815; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=BQ88LTbW2bjMSOWAbmjSTzdR+6tslMVtAHZnogA/mqo=; b=lxD4ozVB09wBxSJ/zrGpYC+W/o/nQREh0+B4fpTD7AckMHnx1Nh5uRI91/5WIYecKvTOxC rbH+3YqTxvW/22DQ== From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Ashok Raj <ashok.raj@linux.intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Arjan van de Veen <arjan@linux.intel.com>, Peter Zijlstra <peterz@infradead.org>, Eric Biederman <ebiederm@xmission.com> Subject: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles Date: Sat, 3 Jun 2023 22:06:54 +0200 (CEST) X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767714654234241909?= X-GMAIL-MSGID: =?utf-8?q?1767714654234241909?= |
Series |
Cure kexec() vs. mwait_play_dead() troubles
|
|
Message
Thomas Gleixner
June 3, 2023, 8:06 p.m. UTC
Hi! Ashok observed triple faults when executing kexec() on a kernel which has 'nosmt' on the kernel commandline and HT enabled in the BIOS. 'nosmt' brings up the HT siblings to the point where they initiliazed the CPU and then rolls the bringup back which parks them in mwait_play_dead(). The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast MCE will immediately shut down the machine. Some detective work revealed that: 1) The kexec kernel can overwrite text, pagetables, stack and data of the previous kernel. 2) If the kexec kernel writes to the memory which is monitored by an "offline" CPU, that CPU resumes execution. That's obviously doomed when the kexec kernel overwrote text, pagetables, data or stack. While on my test machine the first kexec() after reset always "worked", the second one reliably ended up in a triple fault. The following series cures this by: 1) Bringing offline CPUs which are stuck in mwait_play_dead() out of mwait by writing to the monitored cacheline 2) Let the woken up CPUs check the written control word and drop into a HLT loop if the control word requests so. This is only half safe because HLT can resume execution due to NMI, SMI and MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, but there is at least one which prevents the NMI and SMI cause: INIT. 3) If the system uses the regular INIT/STARTUP sequence to wake up secondary CPUS, then "park" all CPUs including the "offline" ones by sending them INIT IPIs. The INIT IPI brings the CPU into a wait for wakeup state which is not affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE problem comes back. But that's not really any different from a CPU sitting in the HLT loop on the previous kernel. If a broadcast MCE arrives, HLT resumes execution and the CPU tries to handle the MCE on overwritten text, pagetables etc. So parking them via INIT is not completely solving the problem, but it takes at least NMI and SMI out of the picture. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec Thanks, tglx --- include/asm/smp.h | 4 + kernel/smp.c | 62 +++++++++++++--------- kernel/smpboot.c | 151 ++++++++++++++++++++++++++++++++++++++++-------------- 3 files changed, 156 insertions(+), 61 deletions(-)
Comments
On Sat, Jun 03, 2023, Thomas Gleixner wrote: > Hi! > > Ashok observed triple faults when executing kexec() on a kernel which has > 'nosmt' on the kernel commandline and HT enabled in the BIOS. > > 'nosmt' brings up the HT siblings to the point where they initiliazed the > CPU and then rolls the bringup back which parks them in mwait_play_dead(). > The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast > MCE will immediately shut down the machine. ... > This is only half safe because HLT can resume execution due to NMI, SMI and > MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except single-step #DB (lol) and RESET. #MC handling is implementation-dependent and *might* cause shutdown, but at least there's a chance it will work. And presumably modern CPUs do pend the #MC until GIF=1. > but there is at least one which prevents the NMI and SMI cause: INIT. > > 3) If the system uses the regular INIT/STARTUP sequence to wake up > secondary CPUS, then "park" all CPUs including the "offline" ones > by sending them INIT IPIs. > > The INIT IPI brings the CPU into a wait for wakeup state which is not > affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE > problem comes back. > > But that's not really any different from a CPU sitting in the HLT loop on > the previous kernel. If a broadcast MCE arrives, HLT resumes execution and > the CPU tries to handle the MCE on overwritten text, pagetables etc. > > So parking them via INIT is not completely solving the problem, but it > takes at least NMI and SMI out of the picture. Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely potentially cause problems too? Why not carve out a page that's hidden across kexec() to hold whatever code+data is needed to safely execute a HLT loop indefinitely? E.g. doesn't the original kernel provide the e820 tables for the post-kexec() kernel? To avoid OOM after many kexec(), reserving a page could be done iff the current kernel wasn't itself kexec()'d.
On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: > On Sat, Jun 03, 2023, Thomas Gleixner wrote: >> This is only half safe because HLT can resume execution due to NMI, SMI and >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, > > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and > *might* cause shutdown, but at least there's a chance it will work. And presumably > modern CPUs do pend the #MC until GIF=1. Abusing SVME for that is definitely in the realm of creative bonus points, but not necessarily a general purpose solution. >> So parking them via INIT is not completely solving the problem, but it >> takes at least NMI and SMI out of the picture. > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > potentially cause problems too? Not that I'm aware of. If so then this would be a hideous firmware bug as firmware must be aware of CPUs which hang around in INIT independent of this. > Why not carve out a page that's hidden across kexec() to hold whatever code+data > is needed to safely execute a HLT loop indefinitely? See below. > E.g. doesn't the original kernel provide the e820 tables for the > post-kexec() kernel? Only for crash kernels if I'm not missing something. Making this work for regular kexec() including this: > To avoid OOM after many kexec(), reserving a page could be done iff > the current kernel wasn't itself kexec()'d. would be possible and I thought about it, but that needs a complete new design of "offline", "shutdown offline" and a non-trivial amount of backwards compatibility magic because you can't assume that the kexec() kernel version is greater or equal to the current one. kexec() is supposed to work both ways, downgrading and upgrading. IOW, that ship sailed long ago. Thanks, tglx
On Tue, Jun 06, 2023, Thomas Gleixner wrote: > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: > > On Sat, Jun 03, 2023, Thomas Gleixner wrote: > >> This is only half safe because HLT can resume execution due to NMI, SMI and > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, > > > > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except > > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and > > *might* cause shutdown, but at least there's a chance it will work. And presumably > > modern CPUs do pend the #MC until GIF=1. > > Abusing SVME for that is definitely in the realm of creative bonus > points, but not necessarily a general purpose solution. Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-) > >> So parking them via INIT is not completely solving the problem, but it > >> takes at least NMI and SMI out of the picture. > > > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > > potentially cause problems too? > > Not that I'm aware of. If so then this would be a hideous firmware bug > as firmware must be aware of CPUs which hang around in INIT independent > of this. I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g. SmmWaitForApArrival(). I've never dug deeply into how EDKII uses SMM, what its timeouts are, etc., I just remember coming across that code when poking around EDKII for other stuff. > > Why not carve out a page that's hidden across kexec() to hold whatever code+data > > is needed to safely execute a HLT loop indefinitely? > > See below. > > > E.g. doesn't the original kernel provide the e820 tables for the > > post-kexec() kernel? > > Only for crash kernels if I'm not missing something. Ah, drat. > Making this work for regular kexec() including this: > > > To avoid OOM after many kexec(), reserving a page could be done iff > > the current kernel wasn't itself kexec()'d. > > would be possible and I thought about it, but that needs a complete new > design of "offline", "shutdown offline" and a non-trivial amount of > backwards compatibility magic because you can't assume that the kexec() > kernel version is greater or equal to the current one. kexec() is > supposed to work both ways, downgrading and upgrading. IOW, that ship > sailed long ago. Right, but doesn't gaining "full" protection require ruling out unenlightened downgrades? E.g. if someone downgrades to an old kernel, doesn't hide the "offline" CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever, then that old kernel will do the naive MWAIT or unprotected HLT and it's hosed again. If we're relying on the admin to hide the offline CPUs, could we usurp an existing kernel param to hide a small chunk of memory instead?
On Mon, Jun 05 2023 at 16:08, Sean Christopherson wrote: > On Tue, Jun 06, 2023, Thomas Gleixner wrote: >> On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: >> > On Sat, Jun 03, 2023, Thomas Gleixner wrote: >> >> This is only half safe because HLT can resume execution due to NMI, SMI and >> >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, >> > >> > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except >> > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and >> > *might* cause shutdown, but at least there's a chance it will work. And presumably >> > modern CPUs do pend the #MC until GIF=1. >> >> Abusing SVME for that is definitely in the realm of creative bonus >> points, but not necessarily a general purpose solution. > > Heh, my follow-up ideas for Intel are to abuse XuCode or SEAM ;-) I feared that :) >> >> So parking them via INIT is not completely solving the problem, but it >> >> takes at least NMI and SMI out of the picture. >> > >> > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely >> > potentially cause problems too? >> >> Not that I'm aware of. If so then this would be a hideous firmware bug >> as firmware must be aware of CPUs which hang around in INIT independent >> of this. > > I was thinking of the EDKII code in UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c, e.g. > SmmWaitForApArrival(). I've never dug deeply into how EDKII uses SMM, what its > timeouts are, etc., I just remember coming across that code when poking around > EDKII for other stuff. There is a comment: Note the SMI Handlers must ALWAYS take into account the cases that not all APs are available in an SMI run. Also not all SMIs required global synchronization. But it's all an inpenetrable mess... >> Making this work for regular kexec() including this: >> >> > To avoid OOM after many kexec(), reserving a page could be done iff >> > the current kernel wasn't itself kexec()'d. >> >> would be possible and I thought about it, but that needs a complete new >> design of "offline", "shutdown offline" and a non-trivial amount of >> backwards compatibility magic because you can't assume that the kexec() >> kernel version is greater or equal to the current one. kexec() is >> supposed to work both ways, downgrading and upgrading. IOW, that ship >> sailed long ago. > > Right, but doesn't gaining "full" protection require ruling out unenlightened > downgrades? E.g. if someone downgrades to an old kernel, doesn't hide the "offline" > CPUs from the kexec() kernel, and boots the old kernel with -nosmt or whatever, > then that old kernel will do the naive MWAIT or unprotected HLT and > it's hosed again. Of course. > If we're relying on the admin to hide the offline CPUs, could we usurp > an existing kernel param to hide a small chunk of memory instead? The only "safe" place is below 1M I think. Not sure whether we have some existing command line option to "hide" a range there. Neither am I sure that this would be always the same range. More questions than answers :) Thanks tglx
On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote: > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: > > On Sat, Jun 03, 2023, Thomas Gleixner wrote: > >> This is only half safe because HLT can resume execution due to NMI, SMI and > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, > > > > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except > > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and > > *might* cause shutdown, but at least there's a chance it will work. And presumably > > modern CPUs do pend the #MC until GIF=1. > > Abusing SVME for that is definitely in the realm of creative bonus > points, but not necessarily a general purpose solution. > > >> So parking them via INIT is not completely solving the problem, but it > >> takes at least NMI and SMI out of the picture. > > > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > > potentially cause problems too? > > Not that I'm aware of. If so then this would be a hideous firmware bug > as firmware must be aware of CPUs which hang around in INIT independent > of this. SMM does do the rendezvous of all CPUs, but also has a way to detect the blocked ones, in WFS via some package scoped ubox register. So it knows to skip those. I can find this in internal sources, but they aren't available in the edk2 open reference code. They happen to be documented only in the BWG, which isn't available freely. I believe its behind the GetSmmDelayedBlockedDisabledCount()-> SmmCpuFeaturesGetSmmRegister()
On Wed, Jun 07, 2023, Ashok Raj wrote: > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote: > > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: > > > On Sat, Jun 03, 2023, Thomas Gleixner wrote: > > >> This is only half safe because HLT can resume execution due to NMI, SMI and > > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, > > > > > > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except > > > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and > > > *might* cause shutdown, but at least there's a chance it will work. And presumably > > > modern CPUs do pend the #MC until GIF=1. > > > > Abusing SVME for that is definitely in the realm of creative bonus > > points, but not necessarily a general purpose solution. > > > > >> So parking them via INIT is not completely solving the problem, but it > > >> takes at least NMI and SMI out of the picture. > > > > > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > > > potentially cause problems too? > > > > Not that I'm aware of. If so then this would be a hideous firmware bug > > as firmware must be aware of CPUs which hang around in INIT independent > > of this. > > SMM does do the rendezvous of all CPUs, but also has a way to detect the > blocked ones, in WFS via some package scoped ubox register. So it knows to > skip those. I can find this in internal sources, but they aren't available > in the edk2 open reference code. They happen to be documented only in the > BWG, which isn't available freely. Ah, so putting CPUs into WFS shouldn't result in odd delays. At least not on bare metal. Hmm, and AFAIK the primary use case for SMM in VMs is for secure boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish. Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.
On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote: > On Wed, Jun 07, 2023, Ashok Raj wrote: > > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote: > > > On Mon, Jun 05 2023 at 10:41, Sean Christopherson wrote: > > > > On Sat, Jun 03, 2023, Thomas Gleixner wrote: > > > >> This is only half safe because HLT can resume execution due to NMI, SMI and > > > >> MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably, > > > > > > > > On Intel. On AMD, enabling EFER.SVME and doing CLGI will block everything except > > > > single-step #DB (lol) and RESET. #MC handling is implementation-dependent and > > > > *might* cause shutdown, but at least there's a chance it will work. And presumably > > > > modern CPUs do pend the #MC until GIF=1. > > > > > > Abusing SVME for that is definitely in the realm of creative bonus > > > points, but not necessarily a general purpose solution. > > > > > > >> So parking them via INIT is not completely solving the problem, but it > > > >> takes at least NMI and SMI out of the picture. > > > > > > > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > > > > potentially cause problems too? > > > > > > Not that I'm aware of. If so then this would be a hideous firmware bug > > > as firmware must be aware of CPUs which hang around in INIT independent > > > of this. > > > > SMM does do the rendezvous of all CPUs, but also has a way to detect the > > blocked ones, in WFS via some package scoped ubox register. So it knows to > > skip those. I can find this in internal sources, but they aren't available > > in the edk2 open reference code. They happen to be documented only in the > > BWG, which isn't available freely. > > Ah, so putting CPUs into WFS shouldn't result in odd delays. At least not on > bare metal. Hmm, and AFAIK the primary use case for SMM in VMs is for secure Never knew SMM had any role in VM's.. I thought SMM was always native. Who owns this SMM for VM's.. from the VirtualBIOS? > boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish. > > Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have. I always seem to turn off secureboot installing Ubuntu :-).. I'll try to find someone who might know especially doing SMM In VM. Can you tell what needs to be validated in the guest? Would doing kexec inside the guest with the new patch set be sufficient? Or you mean in guest, do a kexec and launch secure boot of new kernel? If there is a specific test you want done, let me know.
On Wed, Jun 07, 2023, Ashok Raj wrote: > On Wed, Jun 07, 2023 at 10:33:35AM -0700, Sean Christopherson wrote: > > On Wed, Jun 07, 2023, Ashok Raj wrote: > > > On Tue, Jun 06, 2023 at 12:41:43AM +0200, Thomas Gleixner wrote: > > > > >> So parking them via INIT is not completely solving the problem, but it > > > > >> takes at least NMI and SMI out of the picture. > > > > > > > > > > Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely > > > > > potentially cause problems too? > > > > > > > > Not that I'm aware of. If so then this would be a hideous firmware bug > > > > as firmware must be aware of CPUs which hang around in INIT independent > > > > of this. > > > > > > SMM does do the rendezvous of all CPUs, but also has a way to detect the > > > blocked ones, in WFS via some package scoped ubox register. So it knows to > > > skip those. I can find this in internal sources, but they aren't available > > > in the edk2 open reference code. They happen to be documented only in the > > > BWG, which isn't available freely. > > > > Ah, so putting CPUs into WFS shouldn't result in odd delays. At least not on > > bare metal. Hmm, and AFAIK the primary use case for SMM in VMs is for secure > > Never knew SMM had any role in VM's.. I thought SMM was always native. > > Who owns this SMM for VM's.. from the VirtualBIOS? Yes? > > boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish. > > > > Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have. > > I always seem to turn off secureboot installing Ubuntu :-) Yeah, I don't utilize it in any of my VMs either. > I'll try to find someone who might know especially doing SMM In VM. > > Can you tell what needs to be validated in the guest? Would doing kexec > inside the guest with the new patch set be sufficient? > > Or you mean in guest, do a kexec and launch secure boot of new kernel? Yes? I don't actually have hands on experience with such a setup, I'm familiar with it purely through bug reports, e.g. this one https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com > If there is a specific test you want done, let me know. Smoke testing is all I was thinking. I wouldn't put too much effort into trying to make sure this all works. Like I said earlier, nice to have, but certainly not necessary.
On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote: > > Yes? I don't actually have hands on experience with such a setup, I'm familiar > with it purely through bug reports, e.g. this one > > https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com > > > If there is a specific test you want done, let me know. > > Smoke testing is all I was thinking. I wouldn't put too much effort into trying > to make sure this all works. Like I said earlier, nice to have, but certainly not > necessary. Thanks for the Link Sean. I'll do some followup.
On 6/7/23 19:33, Sean Christopherson wrote: Don't most SMM handlers rendezvous all CPUs? I.e. won't blocking SMIs indefinitely >>>> potentially cause problems too? >>> >>> Not that I'm aware of. If so then this would be a hideous firmware bug >>> as firmware must be aware of CPUs which hang around in INIT independent >>> of this. >> >> SMM does do the rendezvous of all CPUs, but also has a way to detect the >> blocked ones, in WFS via some package scoped ubox register. So it knows to >> skip those. I can find this in internal sources, but they aren't available >> in the edk2 open reference code. They happen to be documented only in the >> BWG, which isn't available freely. > > Ah, so putting CPUs into WFS shouldn't result in odd delays. At least not on > bare metal. Hmm, and AFAIK the primary use case for SMM in VMs is for secure > boot, so taking SMIs after booting and putting CPUs back into WFS should be ok-ish. VMs do not have things like periodic or watchdog SMIs, they only enter SMM in response to IPIs or writes to 0xB1. The writes to 0xB1 in turn should only happen from UEFI runtime services related to the UEFI variable store. Another possibility could be ACPI bytecode from either DSDT or APEI; not implemented yet and very unlikely to happen in the future, but not impossible either. Either way they should not happen before the kexec-ed kernel has brought up all CPUs. Paolo > Finding a victim to test this in a QEMU VM w/ Secure Boot would be nice to have.
On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote: > > https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com > > > If there is a specific test you want done, let me know. > > Smoke testing is all I was thinking. I wouldn't put too much effort into trying > to make sure this all works. Like I said earlier, nice to have, but certainly not > necessary. + Vijay who was helping with testing this inside the VM. + Paolo, Laszlo I haven't found the exact method to test with secure boot/trusted boot yet. But here is what we were able to test thus far. Vijay was able to get OVMF recompiled with SMM included. Thanks to Laszlo for pointing me in the right direction. And Paolo for helping with some basic questions. https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt Surprisingly SMM emulation is sadly damn good :-) Recipe is to generate SMI by writing to port 0xb2. - On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34 goes up by 1 on all logical CPUs. - Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control - Do another port 0xb2, we don't see any hangs - Bring up SMT by echo on > control, and we can see even the offline CPUs got the SMI as indicated by MSR 0x34. (Which is as expected) On guest, the only difference was when we turn on HT again, waking the CPUs from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is perfectly fine) All I was looking for was "no hang". And a normal kexec with newly updated code works well inside a guest. Would this qualify for the smoke test pass? I'll continue to look for a secure boot install if this doesn't close it, just haven't landed at the right spot yet.
On Fri, Jun 16, 2023, Ashok Raj wrote: > On Wed, Jun 07, 2023 at 08:46:22PM -0700, Sean Christopherson wrote: > > > > https://lore.kernel.org/all/BYAPR12MB301441A16CE6CFFE17147888A0A09@BYAPR12MB3014.namprd12.prod.outlook.com > > > > > If there is a specific test you want done, let me know. > > > > Smoke testing is all I was thinking. I wouldn't put too much effort into trying > > to make sure this all works. Like I said earlier, nice to have, but certainly not > > necessary. > > + Vijay who was helping with testing this inside the VM. > + Paolo, Laszlo > > I haven't found the exact method to test with secure boot/trusted boot yet. > But here is what we were able to test thus far. > > Vijay was able to get OVMF recompiled with SMM included. > > Thanks to Laszlo for pointing me in the right direction. And Paolo for > helping with some basic questions. > > https://github.com/tianocore/tianocore.github.io/wiki/Testing-SMM-with-QEMU,-KVM-and-libvirt > > Surprisingly SMM emulation is sadly damn good :-) > > Recipe is to generate SMI by writing to port 0xb2. > > - On native, this does generate a broadcast SMI, the SMI_COUNT MSR 0x34 > goes up by 1 on all logical CPUs. > - Turn off SMT by #echo off > /sys/devices/system/cpu/smt/control > - Do another port 0xb2, we don't see any hangs > - Bring up SMT by echo on > control, and we can see even the offline CPUs > got the SMI as indicated by MSR 0x34. (Which is as expected) > > On guest, the only difference was when we turn on HT again, waking the CPUs > from INIT, SMI_COUNT has zeroed as opposed to the native. (Which is > perfectly fine) All I was looking for was "no hang". And a normal kexec > with newly updated code works well inside a guest. > > Would this qualify for the smoke test pass? I'll continue to look for a > secure boot install if this doesn't close it, just haven't landed at the > right spot yet. Good enough for me, thanks much!
On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote: > > Would this qualify for the smoke test pass? I'll continue to look for a > > secure boot install if this doesn't close it, just haven't landed at the > > right spot yet. > > Good enough for me, thanks much! Thanks a ton Sean.. if anything you have now scared me there is life for SMM afterall even in a guest :-)... Completely took me by surprise! Cheers, Ashok
On Fri, Jun 16, 2023, Ashok Raj wrote: > On Fri, Jun 16, 2023 at 12:00:13PM -0700, Sean Christopherson wrote: > > > Would this qualify for the smoke test pass? I'll continue to look for a > > > secure boot install if this doesn't close it, just haven't landed at the > > > right spot yet. > > > > Good enough for me, thanks much! > > Thanks a ton Sean.. if anything you have now scared me there is life for > SMM afterall even in a guest :-) LOL, you and me both. Thankfully GCE's implementation of Secure Boot for VMs doesn't utilize SMM, so to a large extent I can close my eyes and plug my ears :-)