From patchwork Thu Jun 15 20:33:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 108715 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp909683vqr; Thu, 15 Jun 2023 13:56:23 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7b8bavdBZESEMVjAOx7Buq4lPrcbbgvu6HWsSV9Kg/7lZjIX3fF+KTgN+3ardfHI5rrM2z X-Received: by 2002:a05:6a21:168e:b0:111:97f:6d9d with SMTP id np14-20020a056a21168e00b00111097f6d9dmr389993pzb.62.1686862583027; Thu, 15 Jun 2023 13:56:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686862583; cv=none; d=google.com; s=arc-20160816; b=YXPX5sPnwCCFSjzkHUsOI3bpN2XMDjsxcm9ACPzH2it4Q88z2BlavzvCenV3d59qHY mqlht1iyU/LChJhWkaUSZR+c6R1ytozDt79fBWWVja27ARJpsYhuc5m/km+8FnRIOly2 YwhslC2OgJzLlpIYbDF4rvW9uNGoD9O96O/+FsbtHBWOX0fqgXgi6+shj4YfC9OY7Plk U8h/wb011ZH1UJ4fbu4YZEH94Pi7wt2TB49nR3sAI+UQ4eP6h4iW6NLuOjROVRYfhcBb +zWCqc7JyqcPhif11okaoVoF+K+NNVh8nWgnY7QEz8Q9OrOsiU/qqopgGZHqB7TzAUtH 7FOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:date:mime-version:references:subject:cc:to:from :dkim-signature:dkim-signature:message-id; bh=ppl1FumG424u3l+T6Q6nu0slvzTBObUWClLNeDTKQKA=; b=si2XviOIl9BTCyDwAzkvVhPQgvBxdnQaRi3towS9w2VVEH+smJTavGnd+PHSq3b0Z9 ep8qFemTFOLz0tIcXAFjE3XcKygXF6jRzDbTXomUGohURiTnt1Js43JIItpOOLzpe4OO YnNbmoCWloWNyGG7jLD0KLsNJJJ8M9TiaC+YkvW7NcgjqCdcZBNw6UViv4Hplnjlvp6x mGF2uCh2z7xom7Y72NNLNnuRlN+wi3zqCk524EcYzsfeLeeIuY/tJuHcStEnLCuLQrz8 uIZ9zrQSdL3gB1L1PUGdBDyuA8dyn7OuufoGsHW2ZEjRYcCh4AN+9puMlpguey7Kwyp0 QxKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="odD/+50N"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=ceE0mbKk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x20-20020aa79ad4000000b00657e27bd758si10367994pfp.321.2023.06.15.13.56.10; Thu, 15 Jun 2023 13:56:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="odD/+50N"; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=ceE0mbKk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237795AbjFOUeO (ORCPT + 99 others); Thu, 15 Jun 2023 16:34:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45798 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232432AbjFOUeA (ORCPT ); Thu, 15 Jun 2023 16:34:00 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4EE442711 for ; Thu, 15 Jun 2023 13:33:59 -0700 (PDT) Message-ID: <20230615193330.492257119@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1686861238; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ppl1FumG424u3l+T6Q6nu0slvzTBObUWClLNeDTKQKA=; b=odD/+50NSkGByh44uwDD6QThazZcPdybIJgp2Uv4cOtnRkwgZuG71vAasn6/CrNOn0nByL +7Oai2YpPksaVNLGbB8pEDs1uI8eAkrkWQQ3RGHHYJzfRA0TsTRgihju5SWgx5O5pN9SHx GvjR0XF7Y5yXYEGNrcD4EEu8U9tFD39Wd2X8jJyvZ2xvX37T0AQA8OerS8DLnkbdo32FNY Wf2FwYWBCjF0K0Y50Bi8wz+j6i3CGN05Xk29UrZrH9VHZO3YNTznlPaHswGb66yaVM6hmX /EH+S6qcRcgRkpreDhbKwAc15QT5RHUW1RU6jVPaS1GQA3jbNZCB6sKm+hYshA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1686861238; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=ppl1FumG424u3l+T6Q6nu0slvzTBObUWClLNeDTKQKA=; b=ceE0mbKkYq3Pr9yibPe0kAXssFZhAK9gTajGDqcIKwKg5Pz6004wj82y/RKSklnJyCUJ2c tfMm0OZTV9GlmNAA== From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Mario Limonciello , Tom Lendacky , Tony Battersby , Ashok Raj , Tony Luck , Arjan van de Veen , Eric Biederman , Ashok Raj Subject: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage References: <20230615190036.898273129@linutronix.de> MIME-Version: 1.0 Date: Thu, 15 Jun 2023 22:33:57 +0200 (CEST) X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768803619633665320?= X-GMAIL-MSGID: =?utf-8?q?1768803619633665320?= TLDR: It's a mess. When kexec() is executed on a system with "offline" CPUs, which are parked in mwait_play_dead() it can end up in a triple fault during the bootup of the kexec kernel or cause hard to diagnose data corruption. The reason is that kexec() eventually overwrites the previous kernels text, page tables, data and stack, If it writes to the cache line which is monitored by an previously offlined CPU, MWAIT resumes execution and ends up executing the wrong text, dereferencing overwritten page tables or corrupting the kexec kernels data. Cure this by bringing the offline CPUs out of MWAIT into HLT. Write to the monitored cache line of each offline CPU, which makes MWAIT resume execution. The written control word tells the offline CPUs to issue HLT, which does not have the MWAIT problem. That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as those make it come out of HLT. A follow up change will put them into INIT, which protects at least against NMI and SMI. Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case") Reported-by: Ashok Raj Signed-off-by: Thomas Gleixner Tested-by: Ashok Raj Reviewed-by: Ashok Raj --- arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/smp.c | 5 +++ arch/x86/kernel/smpboot.c | 59 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+) --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu); int wbinvd_on_all_cpus(void); void cond_wakeup_cpu0(void); +void smp_kick_mwait_play_dead(void); + void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -157,6 +158,10 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1) return; + /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */ + if (kexec_in_progress) + smp_kick_mwait_play_dead(); + /* * 1) Send an IPI on the reboot vector to all other CPUs. * --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include #include @@ -106,6 +107,9 @@ struct mwait_cpu_dead { unsigned int status; }; +#define CPUDEAD_MWAIT_WAIT 0xDEADBEEF +#define CPUDEAD_MWAIT_KEXEC_HLT 0x4A17DEAD + /* * Cache line aligned data for mwait_play_dead(). Separate on purpose so * that it's unlikely to be touched by other CPUs. @@ -173,6 +177,10 @@ static void smp_callin(void) { int cpuid; + /* Mop up eventual mwait_play_dead() wreckage */ + this_cpu_write(mwait_cpu_dead.status, 0); + this_cpu_write(mwait_cpu_dead.control, 0); + /* * If waken up by an INIT in an 82489DX configuration * cpu_callout_mask guarantees we don't get here before @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); } + /* Set up state for the kexec() hack below */ + md->status = CPUDEAD_MWAIT_WAIT; + md->control = CPUDEAD_MWAIT_WAIT; + wbinvd(); while (1) { @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void) mb(); __mwait(eax, 0); + if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) { + /* + * Kexec is about to happen. Don't go back into mwait() as + * the kexec kernel might overwrite text and data including + * page tables and stack. So mwait() would resume when the + * monitor cache line is written to and then the CPU goes + * south due to overwritten text, page tables and stack. + * + * Note: This does _NOT_ protect against a stray MCE, NMI, + * SMI. They will resume execution at the instruction + * following the HLT instruction and run into the problem + * which this is trying to prevent. + */ + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT); + while(1) + native_halt(); + } + cond_wakeup_cpu0(); } } +/* + * Kick all "offline" CPUs out of mwait on kexec(). See comment in + * mwait_play_dead(). + */ +void smp_kick_mwait_play_dead(void) +{ + u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT; + struct mwait_cpu_dead *md; + unsigned int cpu, i; + + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) { + md = per_cpu_ptr(&mwait_cpu_dead, cpu); + + /* Does it sit in mwait_play_dead() ? */ + if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT) + continue; + + /* Wait maximal 5ms */ + for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) { + /* Bring it out of mwait */ + WRITE_ONCE(md->control, newstate); + udelay(5); + } + + if (READ_ONCE(md->status) != newstate) + pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu); + } +} + void __noreturn hlt_play_dead(void) { if (__this_cpu_read(cpu_info.x86) >= 4)