From patchwork Fri Oct 14 20:09:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ashok Raj X-Patchwork-Id: 2801 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp357422wrs; Fri, 14 Oct 2022 13:10:33 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6B3uEJO3UL2IB3Kf2dU5ORtwaUB4N8NBYbxIoqcUSjhDUj59LyyxZoi1dIZ0nTTfLLAS8L X-Received: by 2002:a17:90b:4a0b:b0:20c:5ac8:1e2c with SMTP id kk11-20020a17090b4a0b00b0020c5ac81e2cmr7523955pjb.98.1665778232886; Fri, 14 Oct 2022 13:10:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665778232; cv=none; d=google.com; s=arc-20160816; b=oOKLv+7l9ftj0MsJN8Riem6ICS3i31YHSaBuzxc6FTqMNVQnJt6j8dQY1vhtjs5UPM vRewa3jeS+Jrs9OSuBmxDX9YT0p4xqG5VtZ85V9yKNVf8m/gL+9V4vG0NGZRkNwMdfoN eFoH5BbJhnjDAzcuDrStdQgVie8bQ7xyCkST7SCW4ogKrHHXjj4r9uZOxRxPv4k1eA6Z j1VmUnBTeaUyU2lCQwTTNAjaDVlpzQk57fQ5A8SsP6z7x64RliXekiYsBCpIYWD3iM8H KJ7i520806iF9YCI11cumMIizd3A2YOQYVeU24eZ6iOD049JbSOnp9PfUWmPzKTLnHZG yhng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=KkOMUOKYGE669kfcJF29gqwtV7PvJU/pFgRWe9PorHs=; b=X5xoVt8d3CrgssQ3Ylce0X7mkdPOc6Lec6VVVGXjv2J+NoaU7rJn986UIvXOC5EW3f cMiaGqHo2cfj1qKP0McuTVmLUu7yN8Ke2WmAFbJViQDBv02BU0EPpINGTvl4mg9hwAJ3 K7qtkrbIPEFJUuAscPZM9e4q+4vW4mbEqObHL4lTqECqTNy1MsKBuq/Y1+ehxvZ/qxye y3NqC87ojiS87TqRwtmlggdDk4JEhiEfN6VHVrSOs6pxqf0tZgPZvWmZP9MhKGCkc8vU ggQ3M9vherPb69rFGd4vGJCxZ2CMJoJD8+8kzNJ9KV01n8dQYbLChotHh3sO6I0Wao1O FUDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=bc4BqgjQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s5-20020a170902a50500b0017849d0200csi3728076plq.219.2022.10.14.13.10.18; Fri, 14 Oct 2022 13:10:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=bc4BqgjQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231153AbiJNUJw (ORCPT + 99 others); Fri, 14 Oct 2022 16:09:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230292AbiJNUJa (ORCPT ); Fri, 14 Oct 2022 16:09:30 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0DEE31D81A4 for ; Fri, 14 Oct 2022 13:09:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1665778169; x=1697314169; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=r+EN8Zy5jD9nAbdftcXxE+f3RqPdC2OWZt+eOaH6bY8=; b=bc4BqgjQ9spSPEFA78774MiC0tlPLbUqrAzITJlV0/h69IMBNq3jmEVh SJ5u1pFdpdpB0YI3DTR+R8Z42obnDL4k45rNhPiDO3z6/JCxnnaqirX0l U9QrJD11edf4f5rrEFxD0G/qmE1XxHUgoLpIh1IJSCuU/M8vSQNZFzG5y 1XNRC3byQH+MmVD2LBc8rjc49ADGIvpSAo+oNRbQ/FAXslka8GIAtnEOw SAY6edzvOl2q9ssfE1O1WZ2Phxlx83JXsz1WKE9YKBh2twkvxcLG2lRuB EYw7qHSAENM+ubJBQ64bQBS2NVIziN8UxOzBkh5xPvTyivo1/S3el5MLV g==; X-IronPort-AV: E=McAfee;i="6500,9779,10500"; a="304202158" X-IronPort-AV: E=Sophos;i="5.95,185,1661842800"; d="scan'208";a="304202158" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2022 13:09:28 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10500"; a="716870151" X-IronPort-AV: E=Sophos;i="5.95,185,1661842800"; d="scan'208";a="716870151" Received: from araj-ucode.jf.intel.com ([10.23.0.19]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2022 13:09:28 -0700 From: Ashok Raj To: Borislav Petkov , Thomas Gleixner Cc: Tony Luck , Dave Hansen , LKML Mailing List , X86-kernel , Tom Lendacky , Arjan van de Ven , Jacob Jun Pan , Ashok Raj Subject: [PATCH 05/13] x86/microcode: Place siblings in NMI loop while update in progress Date: Fri, 14 Oct 2022 13:09:05 -0700 Message-Id: <20221014200913.14644-6-ashok.raj@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221014200913.14644-1-ashok.raj@intel.com> References: <20221014200913.14644-1-ashok.raj@intel.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1746695076394093283?= X-GMAIL-MSGID: =?utf-8?q?1746695076394093283?= Microcode updates affect the state of the running CPU. In the case of hyperthreads, the thread initiating the update is in a known state (performing wrmsr 0x79), but its HT sibling can be executing arbitrary instructions. If one of these arbitrary instruction is being patched by the update at the same time the sibling is trying to execute from it, its using microcode in an unstable state. Ensuring a rendezvous of all CPUs using stop_machine() ensures that siblings are not executing any random user space code, and stop_machine() also masks interrupts that can be masked. The ones that can still slip in are the exceptions. They are: NMI entry code and NMI handlers can also execute relatively arbitrary instructions. This is an effort to ensure NMI doesn't slip until the wrmsr has completed. == Solution: NMI prevention during update == Before the stop_machine() rendezvous, an NMI handler is registered. The handler is placed at the beginning of all other handlers. The siblings then kick themselves into NMI by doing a self NMI IPI. The handler does two things: - Informs the primary thread that it has entered the NMI handler. Only after all siblings of a core have entered NMI, the primary proceeds with wrmsr to update microcode. - It spins until the primary CPU has completed the wrmsr and informs the sibling to quit the NMI loop. Also an important thing to remember is the microcode requests for exclusive access to the core before performing an update. This effectively pulls the sibling into microcode control until the wrmsr has released exclusive access. Since the sibling is not executing any instructions while the wrmsr completes, no other exceptions will surface on the sibling CPU. Breakpoints can be another source that can lead do taking exceptions. But on NMI entry, the kernel seems to be save/clear/restore the breakpoint control register (DR7). local_db_save() and local_db_restore(). This effectively eliminates any breakpoints leading the sibling into uncontrolled execution. The algorithm is something like this: After stop_machine() all threads are executing __reload_late() hold_sibling_in_nmi() { if (cpu not do_nmi_trap) return; update sibling reached NMI for primary to continue while (primary not done with update) wait; return; } exc_nmi:IDT() { .... hold_sibling_in_nmi(); ... } __reload_late() { entry_rendezvous(&late_cpus_in); if (this_cpu is first_cpu in the core) wait for siblings to drop in NMI apply_microcode() set completion to release sibling from NMI else set sibling info to drop into NMI send self_IPI(NMI_VECTOR); wait_for_siblings: exit_rendezvous(&late_cpus_out); } reload_late() { register_nmi_handler() stop_machine(__reload_late); unregister_nmi_handler(); } Suggested-by: Thomas Gleixner Signed-off-by: Ashok Raj --- arch/x86/include/asm/microcode.h | 33 ++++++++ arch/x86/kernel/cpu/microcode/core.c | 101 ++++++++++++++++++++++++- arch/x86/kernel/cpu/microcode/nmi.c | 72 ++++++++++++++++++ arch/x86/kernel/nmi.c | 7 ++ arch/x86/kernel/cpu/microcode/Makefile | 1 + 5 files changed, 210 insertions(+), 4 deletions(-) create mode 100644 arch/x86/kernel/cpu/microcode/nmi.c diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h index 7f7800e15ed0..23d25ef546a5 100644 --- a/arch/x86/include/asm/microcode.h +++ b/arch/x86/include/asm/microcode.h @@ -6,6 +6,33 @@ #include #include +/* + * Although this is a per-cpu structure, both the primary and siblings use + * only the primary structure to communicate. + * + * num_core_cpus - Number of CPUs in the core + * callin - Siblings set to inform primary on NMI arrival + * core_done - Set by primary once microcode update has completed + */ +struct core_rendez { + int num_core_cpus; + atomic_t callin; + atomic_t core_done; + atomic_t failed; +}; + +DECLARE_PER_CPU(struct core_rendez, core_sync); + +/* + * The following structure is only used by secondary. + * Sets the primary per_cpu variable to be found inside the NMI handler to + * indicate this CPU is supposed to drop into NMI. Its consulted in the + * NMI handler before entring the loop waiting for primary to finish the + * loading process. Once loading is complete the NMI handler clears this + * pointer. + */ +DECLARE_PER_CPU(struct core_rendez *, nmi_primary_ptr); + struct ucode_patch { struct list_head plist; void *data; /* Intel uses only this one */ @@ -136,4 +163,10 @@ static inline void reload_early_microcode(void) { } static inline void microcode_bsp_resume(void) { } #endif +#ifdef CONFIG_MICROCODE_LATE_LOADING +extern void hold_sibling_in_nmi(void); +#else +static inline void hold_sibling_in_nmi(void) { } +#endif + #endif /* _ASM_X86_MICROCODE_H */ diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index e4135b4fdbc6..50652b019233 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -39,6 +39,8 @@ #include #include #include +#include +#include #define DRIVER_VERSION "2.2" @@ -432,6 +434,59 @@ static int __wait_for_cpus(atomic_t *t, long long timeout) return 0; } +/* + * This simply ensures that the self IPI with NMI to siblings is marked as + * handled. + */ +static int ucode_nmi_cb(unsigned int val, struct pt_regs *regs) +{ + return NMI_HANDLED; +} + +/* + * Primary thread waits for all siblings to report that they have enterered + * the NMI handler + */ +static int __wait_for_core_siblings(struct core_rendez *rendez) +{ + int num_sibs = rendez->num_core_cpus - 1; + unsigned long long timeout = NSEC_PER_MSEC; + atomic_t *t = &rendez->callin; + int cpu = smp_processor_id(); + + while (atomic_read(t) < num_sibs) { + cpu_relax(); + ndelay(SPINUNIT); + touch_nmi_watchdog(); + timeout -= SPINUNIT; + if (timeout < SPINUNIT) { + pr_err("CPU%d timedout waiting for siblings\n", cpu); + atomic_inc(&rendez->failed); + return 1; + } + } + return 0; +} + +static void prepare_for_nmi(void) +{ + int cpu, first_cpu; + struct core_rendez *pcpu_core; + + for_each_online_cpu(cpu) { + first_cpu = cpumask_first(topology_sibling_cpumask(cpu)); + if (cpu != first_cpu) + continue; + + pcpu_core = &per_cpu(core_sync, first_cpu); + pcpu_core->num_core_cpus = + cpumask_weight(topology_sibling_cpumask(cpu)); + atomic_set(&pcpu_core->callin, 0); + atomic_set(&pcpu_core->core_done, 0); + atomic_set(&pcpu_core->failed, 0); + } +} + /* * Returns: * < 0 - on error @@ -439,14 +494,15 @@ static int __wait_for_cpus(atomic_t *t, long long timeout) */ static int __reload_late(void *info) { - int cpu = smp_processor_id(); + int first_cpu, cpu = smp_processor_id(); + struct core_rendez *pcpu_core; enum ucode_state err; int ret = 0; /* * Wait for all CPUs to arrive. A load will not be attempted unless all * CPUs show up. - * */ + */ if (__wait_for_cpus(&late_cpus_in, NSEC_PER_SEC)) return -1; @@ -457,10 +513,32 @@ static int __reload_late(void *info) * loading attempts happen on multiple threads of an SMT core. See * below. */ - if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu) + first_cpu = cpumask_first(topology_sibling_cpumask(cpu)); + pcpu_core = &per_cpu(core_sync, first_cpu); + + /* + * Set the CPUs that we should hold in NMI until the primary has + * completed the microcode update. + */ + if (first_cpu == cpu) { + /* + * Wait for all siblings to enter + * NMI before performing the update + */ + ret = __wait_for_core_siblings(pcpu_core); + if (ret || atomic_read(&pcpu_core->failed)) { + pr_err("CPU %d core lead timeout waiting for siblings\n", cpu); + ret = -1; + } + pr_debug("Primary CPU %d proceeding with update\n", cpu); apply_microcode_local(&err); - else + atomic_set(&pcpu_core->core_done, 1); + } else { + /* We set the per-cpu of sibling in this case */ + this_cpu_write(nmi_primary_ptr, pcpu_core); + apic->send_IPI_self(NMI_VECTOR); goto wait_for_siblings; + } if (err >= UCODE_NFOUND) { if (err == UCODE_ERROR) @@ -496,16 +574,31 @@ static int microcode_reload_late(void) pr_err("Attempting late microcode loading - it is dangerous and taints the kernel.\n"); pr_err("You should switch to early loading, if possible.\n"); + /* + * Used for late_load entry and exit rendezvous + */ atomic_set(&late_cpus_in, 0); atomic_set(&late_cpus_out, 0); + prepare_for_nmi(); + + ret = register_nmi_handler(NMI_LOCAL, ucode_nmi_cb, NMI_FLAG_FIRST, + "ucode_nmi"); + if (ret) { + pr_err("Unable to register NMI handler\n"); + goto done; + } + ret = stop_machine_cpuslocked(__reload_late, NULL, cpu_online_mask); if (ret == 0) microcode_check(); + unregister_nmi_handler(NMI_LOCAL, "ucode_nmi"); + pr_info("Reload completed, microcode revision: 0x%x -> 0x%x\n", old, boot_cpu_data.microcode); +done: return ret; } diff --git a/arch/x86/kernel/cpu/microcode/nmi.c b/arch/x86/kernel/cpu/microcode/nmi.c new file mode 100644 index 000000000000..9245dd1e8c9d --- /dev/null +++ b/arch/x86/kernel/cpu/microcode/nmi.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022 Ashok Raj + * + * X86 CPU microcode update NMI handler. + * + */ + +#include +#include +#include +#include + +#include + +#define SPINUNIT 100 /* 100 nsec */ + +DEFINE_PER_CPU(struct core_rendez, core_sync); +DEFINE_PER_CPU(struct core_rendez *, nmi_primary_ptr); + +#define SPINUNIT 100 /* 100 nsec */ + +static void delay(int ms) +{ + unsigned long timeout = jiffies + ((ms * HZ) / 1000); + + while (time_before(jiffies, timeout)) + cpu_relax(); +} + +/* + * Siblings wait until microcode update is completed by the primary thread. + */ +static int __wait_for_update(atomic_t *t) +{ + unsigned long long timeout = NSEC_PER_MSEC; + + while (!arch_atomic_read(t)) { + cpu_relax(); + delay(1); + //touch_nmi_watchdog(); + timeout -= SPINUNIT; + if (timeout < SPINUNIT) + return 1; + } + return 0; +} + +noinstr void hold_sibling_in_nmi(void) +{ + struct core_rendez *pcpu_core; + int ret = 0; + + pcpu_core = this_cpu_read(nmi_primary_ptr); + if (likely(!pcpu_core)) + return; + + /* + * Increment the callin to inform primary thread that the sibling + * has arrived and parked in the NMI handler + */ + arch_atomic_inc(&pcpu_core->callin); + + ret = __wait_for_update(&pcpu_core->core_done); + if (ret) + atomic_inc(&pcpu_core->failed); + + /* + * Clear the nmi_trap, so future NMI's won't be affected + */ + this_cpu_write(nmi_primary_ptr, NULL); +} diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index cec0bfa3bc04..619afeaef07c 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -505,6 +506,12 @@ DEFINE_IDTENTRY_RAW(exc_nmi) this_cpu_write(nmi_dr7, local_db_save()); + /* + * If microcodeupdate is in progress, check and hold the sibling in + * the NMI until primary has completed the update + */ + hold_sibling_in_nmi(); + irq_state = irqentry_nmi_enter(regs); inc_irq_stat(__nmi_count); diff --git a/arch/x86/kernel/cpu/microcode/Makefile b/arch/x86/kernel/cpu/microcode/Makefile index 34098d48c48f..e469990bba73 100644 --- a/arch/x86/kernel/cpu/microcode/Makefile +++ b/arch/x86/kernel/cpu/microcode/Makefile @@ -3,3 +3,4 @@ microcode-y := core.o obj-$(CONFIG_MICROCODE) += microcode.o microcode-$(CONFIG_MICROCODE_INTEL) += intel.o microcode-$(CONFIG_MICROCODE_AMD) += amd.o +microcode-$(CONFIG_MICROCODE_LATE_LOADING) += nmi.o