Message ID | 20230613121615.639116359@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp506733vqr; Tue, 13 Jun 2023 05:31:40 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7APfG+a79wBd+vHgnPDdqgayOCY1dawfkmNfuoYR9b7JG784k5u+bEpvBKPyFmZD3MDgjz X-Received: by 2002:a05:6402:1291:b0:516:2aaa:9bdd with SMTP id w17-20020a056402129100b005162aaa9bddmr8793829edv.7.1686659499925; Tue, 13 Jun 2023 05:31:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686659499; cv=none; d=google.com; s=arc-20160816; b=JOSJRSf4Dn46CbvH8J41OSOWD/nyMiMUcraxy7f/Qon6yFKvIPFOnDIIJJq7/+a8H6 Sr839WwtoRG1K3NrIi42Q5YknqWG8H0gT4ik5OhptzmgOXP9UCqidUxuaKteasSnDKFg nsqGw/2Wzhwkti1dqMMZSfckcXqyjafEAqloTB3brG2Ll51PWRxWlPaQVb/4ZzsvUoNf USEYwe+qVbSYCAGhLToKt8ramQGfFi4IL43ZBDjh4PlLzlWLrJp8KMybCQkiCKhyOFgW kVIFY1Q17iR0BXFjLRSwmK8jlmfEzEkMV9DoQhNBcYFvl6vHWaqFpz3ir3t4Ax6R+66L HTQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:date:mime-version:references:subject:cc:to:from :dkim-signature:dkim-signature:message-id; bh=R5syc+q39gNuEsthP9jc/1HJgP0yzbDEM2sa8HGymTo=; b=p6YPmXEBN9tnK3TAFkadArD2UWHktHx6nrW6E4YJOfpRUQRarrdOxvCZVwaqbW7axW +l75SaCFb+yuuEdJCl0Szgo4DJNEPxMaQFB9/cUA+HtaWbhnI1Z9av34k5uYhH2uvNe1 W0syXhdI7XgzRTS9Y54D2J6Y+hYwlihX9nP4Ldtg6DpEefze2Lz/MbEGA6OR0xLDRdvu r9V0PXEDBWyKhvWFp+rFrm0x//viaZt2heYXWqnfN6h//X+of3R5OyR01OeVbHidBkIJ /zGB8JFcSkVci34xvnf2gCHNccgp6gJTGomjCPt5Vi+LfqQCQlnpQuv6XQ8k+nzvT69I eALw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=QXedgiIL; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z3-20020aa7d403000000b005187910e766si766260edq.191.2023.06.13.05.31.03; Tue, 13 Jun 2023 05:31:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=QXedgiIL; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242263AbjFMMSC (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 13 Jun 2023 08:18:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57104 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239026AbjFMMR7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Jun 2023 08:17:59 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0BB810CB for <linux-kernel@vger.kernel.org>; Tue, 13 Jun 2023 05:17:57 -0700 (PDT) Message-ID: <20230613121615.639116359@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1686658676; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=R5syc+q39gNuEsthP9jc/1HJgP0yzbDEM2sa8HGymTo=; b=QXedgiILOgX4U9zO3CpAhkrakhg9ImQwY+RPK/CFpWR/NPGFDxYzxTb01iiQD+kdy/xIMX S8AghohdvJy4E2MXd438LrqAVWwiybcycrcDk2/D+u9Cdr0yJ8ziniVcOroKsFSKujZULM mvzlFJiUyvSNksBMEekXftCnW86ji3TOu70xvjvs89LCGvDxWO95cAphre5e7jAW6yUZ8y DVrXPvsGZR4jLkQkH4A37eTEutJX8+FFSA1aT3JiIXJ7N/wAbaYWUPLKE8cXrDlRHD9YmZ bKUu3zxKy3Dwe5EvNb26Iw8xv5mwmQsI28UaqUgm6k//8FMxGy9eAMUm13ud/A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1686658676; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=R5syc+q39gNuEsthP9jc/1HJgP0yzbDEM2sa8HGymTo=; b=BJ/4X4yKAF5hpJKeO67DVXz59+LWF0mDLNOOQ+A5fnBzBhTjR4Pk2hTQUD1QKX8dR5VqI+ v+20bgFNZFvVyKCQ== From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Mario Limonciello <mario.limonciello@amd.com>, Tom Lendacky <thomas.lendacky@amd.com>, Tony Battersby <tonyb@cybernetics.com>, Ashok Raj <ashok.raj@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Arjan van de Veen <arjan@linux.intel.com>, Eric Biederman <ebiederm@xmission.com> Subject: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust References: <20230613115353.599087484@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Date: Tue, 13 Jun 2023 14:17:55 +0200 (CEST) X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768590671435042167?= X-GMAIL-MSGID: =?utf-8?q?1768590671435042167?= |
Series |
x86/smp: Cure stop_other_cpus() and kexec() troubles
|
|
Commit Message
Thomas Gleixner
June 13, 2023, 12:17 p.m. UTC
Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.
That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus()
send_IPIs(REBOOT); stop_this_cpu()
while (num_online_cpus() > 1); set_online(false);
proceed... -> hang
wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.
Make this more robust by adding a counter which is set to the number of
online CPUs before sending the IPIs and decremented in stop_this_cpu()
after the WBINVD completed. Check for that counter in stop_other_cpus()
instead of watching num_online_cpus().
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
arch/x86/include/asm/cpu.h | 2 ++
arch/x86/kernel/process.c | 10 ++++++++++
arch/x86/kernel/smp.c | 15 ++++++++++++---
3 files changed, 24 insertions(+), 3 deletions(-)
Comments
Hi Thomas, On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote: > Tony reported intermittent lockups on poweroff. His analysis identified the > wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that > on SME enabled machines a kexec() does not leave any stale data in the > caches when switching from encrypted to non-encrypted mode or vice versa. > > That wbindv() is conditional on the SME feature bit which is read directly > from CPUID. But that readout does not check whether the CPUID leaf is > available or not. If it's not available the CPU will return the value of > the highest supported leaf instead. Depending on the content the "SME" bit > might be set or not. > > That's incorrect but harmless. Making the CPUID readout conditional makes > the observed hangs go away, but it does not fix the underlying problem: > > CPU0 CPU1 > > stop_other_cpus() > send_IPIs(REBOOT); stop_this_cpu() > while (num_online_cpus() > 1); set_online(false); > proceed... -> hang > wbinvd() > > WBINVD is an expensive operation and if multiple CPUs issue it at the same > time the resulting delays are even larger. Is this situation similar to what happened with the unexpected wakeups from mwait_play_dead()? i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous kernel marches past the wbinvd() instruction since we didn't wait to ensure this has indeed completed? native_machine_halt() { machine_shutdown()->stop_other_cpus() stop_this_cpu();<---- Unbalanced atomic_dec()? } But the last stop_this_cpu() in native_machine_halt() would make the count go negative? Maybe that's OK since no one is waiting for it to go zero at that point? > > But CPU0 already observed num_online_cpus() going down to 1 and proceeds > which causes the system to hang. > > Make this more robust by adding a counter which is set to the number of > online CPUs before sending the IPIs and decremented in stop_this_cpu() > after the WBINVD completed. Check for that counter in stop_other_cpus() > instead of watching num_online_cpus(). > > Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use") > Reported-by: Tony Battersby <tonyb@cybernetics.com> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com > --- > arch/x86/include/asm/cpu.h | 2 ++ > arch/x86/kernel/process.c | 10 ++++++++++ > arch/x86/kernel/smp.c | 15 ++++++++++++--- > 3 files changed, 24 insertions(+), 3 deletions(-) > > > --- a/arch/x86/include/asm/cpu.h > +++ b/arch/x86/include/asm/cpu.h > @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void); > int intel_find_matching_signature(void *mc, unsigned int csig, int cpf); > int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type); > > +extern atomic_t stop_cpus_count; > + > #endif /* _ASM_X86_CPU_H */ > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -759,6 +759,8 @@ bool xen_set_default_idle(void) > } > #endif > > +atomic_t stop_cpus_count; > + > void __noreturn stop_this_cpu(void *dummy) > { > local_irq_disable(); > @@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm > */ > if (cpuid_eax(0x8000001f) & BIT(0)) > native_wbinvd(); > + > + /* > + * native_stop_other_cpus() will write to @stop_cpus_count after > + * observing that it went down to zero, which will invalidate the > + * cacheline on this CPU. > + */ > + atomic_dec(&stop_cpus_count); > + > for (;;) { > /* > * Use native_halt() so that memory contents don't change > --- a/arch/x86/kernel/smp.c > +++ b/arch/x86/kernel/smp.c > @@ -27,6 +27,7 @@ > #include <asm/mmu_context.h> > #include <asm/proto.h> > #include <asm/apic.h> > +#include <asm/cpu.h> > #include <asm/idtentry.h> > #include <asm/nmi.h> > #include <asm/mce.h> > @@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w > if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1) > return; > > + atomic_set(&stop_cpus_count, num_online_cpus() - 1); > + > /* sync above data before sending IRQ */ > wmb(); > > @@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w > * CPUs reach shutdown state. > */ > timeout = USEC_PER_SEC; > - while (num_online_cpus() > 1 && timeout--) > + while (atomic_read(&stop_cpus_count) > 0 && timeout--) > udelay(1); > } > > /* if the REBOOT_VECTOR didn't work, try with the NMI */ > - if (num_online_cpus() > 1) { > + if (atomic_read(&stop_cpus_count) > 0) { > /* > * If NMI IPI is enabled, try to register the stop handler > * and send the IPI. In any case try to wait for the other > @@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w > * one or more CPUs do not reach shutdown state. > */ > timeout = USEC_PER_MSEC * 10; > - while (num_online_cpus() > 1 && (wait || timeout--)) > + while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--)) > udelay(1); > } If we go down the INIT path, life is less complicated.. After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs. Won't this completely update the atomic_dec() since CPUs in hlt() will also take the NMI correct? I'm not sure if this is problematic. Or should we reinitialize stop_cpus_count before the NMI hurrah? > > @@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w > disable_local_APIC(); > mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); > local_irq_restore(flags); > + > + /* > + * Ensure that the cache line is invalidated on the other CPUs. See > + * comment vs. SME in stop_this_cpu(). > + */ > + atomic_set(&stop_cpus_count, INT_MAX); Didn't understand why INT_MAX here?
On Wed, Jun 14 2023 at 12:42, Ashok Raj wrote: > On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote: >> >> WBINVD is an expensive operation and if multiple CPUs issue it at the same >> time the resulting delays are even larger. > > Is this situation similar to what happened with the unexpected wakeups from > mwait_play_dead()? > > i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous > kernel marches past the wbinvd() instruction since we didn't wait to ensure > this has indeed completed? This is about reboot and I don't know how wbinvd() interacts with that. But yes, this could be an issue vs. kexec() too. > native_machine_halt() > { > machine_shutdown()->stop_other_cpus() > stop_this_cpu();<---- Unbalanced atomic_dec()? > } > > But the last stop_this_cpu() in native_machine_halt() would > make the count go negative? Maybe that's OK since no one is waiting for it > to go zero at that point? Correct it's the CPU which stopped the others. >> timeout = USEC_PER_MSEC * 10; >> - while (num_online_cpus() > 1 && (wait || timeout--)) >> + while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--)) >> udelay(1); >> } > > If we go down the INIT path, life is less complicated.. > > After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs. > Won't this completely update the atomic_dec() since CPUs in hlt() will also > take the NMI correct? I'm not sure if this is problematic. > > Or should we reinitialize stop_cpus_count before the NMI hurrah Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch! >> + /* >> + * Ensure that the cache line is invalidated on the other CPUs. See >> + * comment vs. SME in stop_this_cpu(). >> + */ >> + atomic_set(&stop_cpus_count, INT_MAX); > > Didn't understand why INT_MAX here? Any random number will do. The only purpose is to ensure that there is no dirty cache line on the other (stopped) CPUs. Now let me look into this NMI cruft. Thanks, tglx
On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote: > > If we go down the INIT path, life is less complicated.. > > > > After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs. > > Won't this completely update the atomic_dec() since CPUs in hlt() will also > > take the NMI correct? I'm not sure if this is problematic. > > > > Or should we reinitialize stop_cpus_count before the NMI hurrah > > Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch! > > >> + /* > >> + * Ensure that the cache line is invalidated on the other CPUs. See > >> + * comment vs. SME in stop_this_cpu(). > >> + */ > >> + atomic_set(&stop_cpus_count, INT_MAX); > > > > Didn't understand why INT_MAX here? > > Any random number will do. The only purpose is to ensure that there is > no dirty cache line on the other (stopped) CPUs. > > Now let me look into this NMI cruft. > Maybe if each CPU going down can set their mask, we can simply hit NMI to only the problematic ones? The simple count doesn't capture the CPUs in trouble.
On Wed, Jun 14 2023 at 13:47, Ashok Raj wrote: > On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote: >> >> Now let me look into this NMI cruft. >> > > Maybe if each CPU going down can set their mask, we can simply hit NMI to > only the problematic ones? > > The simple count doesn't capture the CPUs in trouble. Even a mask is not cutting it. If CPUs did not react on the reboot vector then there is no guarantee that they are not going to do so concurrently to the NMI IPI: CPU0 CPU1 IPI(BROADCAST, REBOOT); wait() // timeout stop_this_cpu() if (!all_stopped()) { for_each_cpu(cpu, mask) { mark_stopped(); <- all_stopped() == true now IPI(cpu, NMI); } --> NMI() // no wait() because all_stopped() == true proceed_and_hope() .... On bare metal this is likely to "work" by chance, but in a guest all bets are off. I'm not surprised at all. The approach of piling hardware and firmware legacy on top of hardware and firmware legacy in the hope that we can "fix" that in software was wrong from the very beginning. What's surprising is that this worked for a really long time. Though with increasing complexity the thereby produced debris is starting to rear its ugly head. I'm sure the marketing departements of _all_ x86 vendors will come up with a brilliant slogan for that. Something like: "We are committed to ensure that you are able to experience the failures of the past forever with increasingly improved performance and new exciting features which are fully backwards failure compatible." TBH, the (OS) software industry has proliferated that by joining the 'features first' choir without much thought and push back. See arch/x86/kernel/cpu/* for prime examples. Ranted enough. I'm going to sleep now and look at this mess tomorrow morning with brain awake again. Though that will not change the underlying problem, which is unfixable. Thanks, tglx
--- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void); int intel_find_matching_signature(void *mc, unsigned int csig, int cpf); int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type); +extern atomic_t stop_cpus_count; + #endif /* _ASM_X86_CPU_H */ --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -759,6 +759,8 @@ bool xen_set_default_idle(void) } #endif +atomic_t stop_cpus_count; + void __noreturn stop_this_cpu(void *dummy) { local_irq_disable(); @@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm */ if (cpuid_eax(0x8000001f) & BIT(0)) native_wbinvd(); + + /* + * native_stop_other_cpus() will write to @stop_cpus_count after + * observing that it went down to zero, which will invalidate the + * cacheline on this CPU. + */ + atomic_dec(&stop_cpus_count); + for (;;) { /* * Use native_halt() so that memory contents don't change --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -27,6 +27,7 @@ #include <asm/mmu_context.h> #include <asm/proto.h> #include <asm/apic.h> +#include <asm/cpu.h> #include <asm/idtentry.h> #include <asm/nmi.h> #include <asm/mce.h> @@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1) return; + atomic_set(&stop_cpus_count, num_online_cpus() - 1); + /* sync above data before sending IRQ */ wmb(); @@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w * CPUs reach shutdown state. */ timeout = USEC_PER_SEC; - while (num_online_cpus() > 1 && timeout--) + while (atomic_read(&stop_cpus_count) > 0 && timeout--) udelay(1); } /* if the REBOOT_VECTOR didn't work, try with the NMI */ - if (num_online_cpus() > 1) { + if (atomic_read(&stop_cpus_count) > 0) { /* * If NMI IPI is enabled, try to register the stop handler * and send the IPI. In any case try to wait for the other @@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w * one or more CPUs do not reach shutdown state. */ timeout = USEC_PER_MSEC * 10; - while (num_online_cpus() > 1 && (wait || timeout--)) + while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--)) udelay(1); } @@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); local_irq_restore(flags); + + /* + * Ensure that the cache line is invalidated on the other CPUs. See + * comment vs. SME in stop_this_cpu(). + */ + atomic_set(&stop_cpus_count, INT_MAX); } /*