[-v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
Message ID | Y71TglxSLJKO17SY@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp2704512wrt; Tue, 10 Jan 2023 04:04:12 -0800 (PST) X-Google-Smtp-Source: AMrXdXtNdy+u6FUohr5y79urtxRT9n96Ls68LyeG7DPfWEnJE+dnM00OO/DARZ7Q8MbZ9XGxGksM X-Received: by 2002:a17:903:234b:b0:192:9f2d:d6b9 with SMTP id c11-20020a170903234b00b001929f2dd6b9mr58119635plh.9.1673352252300; Tue, 10 Jan 2023 04:04:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673352252; cv=none; d=google.com; s=arc-20160816; b=cIPoMXDgyN4g/TH0wFrDe1uYgu6Mf/QUNTO8gO7l4LXDGTwH912Nwgrl3B5LKmSanK ncBjsvI6f0ChcWomTTXPTOOgiPawmE1C5cajFNlV3EEIpUB5teBVuN0RGTcNV8N0fgIa 5s6Xhxqq0PpkceAWJLjm1gWGBiXFCFxSac1ohnx+c/H0gX4tTbxnyXiYItfyd6wpu7YP 4OvanR9bJ4isbyFXIyD06R5hKCBa3B5M53AcPWkb5ywy+gSQmgXQGOJVUyZlg6acbjTb wYUM6KI7MSH2eTa9Kk2Dtxsa+lnedgoSmjji+fn1+Gqy9aGj/RdtrhMgk73wqpywqBgr 7G8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:sender:dkim-signature; bh=VolYRiSABhDndYYNEKONj1lyAiYPsl2ohB5ly0YsFTQ=; b=ediyE2VQZYY7bWfLGobpVonCWi6P5HnIpxtR50DXW/2N1NfUL5MvUQt9dWXWObm8RJ NVAWrY3pcHnBt3K8Mu1nh8uhGapKmLOmKPCQosdZqgozWCzumlv7+Sme/cIh0Km3dgiU DrXQ98CyiL3pLCu/KmFxyJQCD3J6hSdBYXRqh2m02g9kAWxNOddSj8lht8jMfnDgKhaP ve961hN9OhEJ5jx+kYGtnIJHqfQYd9pVGP4MC/7dt/FZPVjLxxGxfssc8Jnzytx9fpJf BWs2iaZtbldXH9HCXoGNEf2+5Nguip1+2LgKJ8Ji69LW/cI4InJj35btOlNUqeWWKEpA 0Piw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=m+YDO5LI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d7-20020a170903230700b00174c5fdc8d2si12264201plh.307.2023.01.10.04.03.58; Tue, 10 Jan 2023 04:04:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=m+YDO5LI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232482AbjAJMBO (ORCPT <rfc822;syz17693488234@gmail.com> + 99 others); Tue, 10 Jan 2023 07:01:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48664 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231733AbjAJMBM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 10 Jan 2023 07:01:12 -0500 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC03458327 for <linux-kernel@vger.kernel.org>; Tue, 10 Jan 2023 04:01:10 -0800 (PST) Received: by mail-wr1-x42a.google.com with SMTP id d17so11473979wrs.2 for <linux-kernel@vger.kernel.org>; Tue, 10 Jan 2023 04:01:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=VolYRiSABhDndYYNEKONj1lyAiYPsl2ohB5ly0YsFTQ=; b=m+YDO5LITD9TyN0U8OC2I9QvVqvhTGruhVtt8aHGC8HyOF4qKI3fZM3F1HJ/d4qX3p CzqtvXpM3pBRhAF2smMu0K/qFgFr1dcv3bVhE2tp5DASWMNLVHId316rh9YBnXrLmGfc 7RcL+Azp35ao2sCYy/VYdLqDgHCYktElvUOkqjjR/qy8ZSXhOI1JnuVlh/zyTNFytga4 QMlSWzu7H+ZLn6b9g0QOSO/HFuvi7mj4VjQczmyz3w2yeImvIb68VltKLWyMDKhN2jQn H5X/RpombrCVb7UkWMdFKZsR4DXQsIOMWIN96aR9nCADk/4eRd/EaNX7WvmscQxkGpzV oajg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VolYRiSABhDndYYNEKONj1lyAiYPsl2ohB5ly0YsFTQ=; b=iGN+bHpsLCDOlB2fOfnCYV3xfys75LjyEwhWuDfCmTiF42TpLiXQcBwXggnKq2DWDo xmLdedo+wUK2myDrJ9ac2x1gE1OwTLyuu6NDq1XXORVSRsHIELCxSRnoVBdk5LwUfKUS ZSVDDKiL1Ws2sLrWzWUUwJuQ15YgfVsoVbUPBdDGSIBTs4GPO5cJ9FGcSHfZgzbc3br1 3dYi2LUjYLbZvc4n/SARUXZJP7TZDB5y244pcIGlVubGZj0hF6IBWtHHGiNT30uBqYfn 1wF2Wq4FiR9qVr+eENS5b55schylo75+5F8HhPTP/F3ArAa/iJNRijXVKMEbHVpu7/k/ 12Nw== X-Gm-Message-State: AFqh2kr6ufG41gyUUQROvhIs+suZzDuOM6O9BkJ6uSZfofshAxjRGnPc 5/TheUgrl1OyCoW85tfVoJY= X-Received: by 2002:adf:fa84:0:b0:28b:ca44:6458 with SMTP id h4-20020adffa84000000b0028bca446458mr27339809wrr.2.1673352069282; Tue, 10 Jan 2023 04:01:09 -0800 (PST) Received: from gmail.com (1F2EF2EB.nat.pool.telekom.hu. [31.46.242.235]) by smtp.gmail.com with ESMTPSA id q4-20020adfdfc4000000b002bc6c180738sm5241873wrn.90.2023.01.10.04.01.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Jan 2023 04:01:08 -0800 (PST) Sender: Ingo Molnar <mingo.kernel.org@gmail.com> Date: Tue, 10 Jan 2023 13:01:06 +0100 From: Ingo Molnar <mingo@kernel.org> To: Zeng Heng <zengheng4@huawei.com> Cc: michael.roth@amd.com, bp@alien8.de, hpa@zytor.com, tglx@linutronix.de, sathyanarayanan.kuppuswamy@linux.intel.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, keescook@chromium.org, mingo@redhat.com, dave.hansen@linux.intel.com, brijesh.singh@amd.com, linux-kernel@vger.kernel.org, x86@kernel.org, liwei391@huawei.com Subject: [PATCH -v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes Message-ID: <Y71TglxSLJKO17SY@gmail.com> References: <20230110102745.2514694-1-zengheng4@huawei.com> <Y71FJ+G0NGQe3Ppq@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <Y71FJ+G0NGQe3Ppq@gmail.com> X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754636125633497609?= X-GMAIL-MSGID: =?utf-8?q?1754637011176855630?= |
Series |
[-v2] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes
|
|
Commit Message
Ingo Molnar
Jan. 10, 2023, 12:01 p.m. UTC
* Ingo Molnar <mingo@kernel.org> wrote: > > * Zeng Heng <zengheng4@huawei.com> wrote: > > > +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code) > > +{ > > + /* ignore */ > > +} > > diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c > > index 6debb816e83d..b169c9728d52 100644 > > --- a/arch/x86/boot/compressed/idt_64.c > > +++ b/arch/x86/boot/compressed/idt_64.c > > @@ -60,6 +60,7 @@ void load_stage2_idt(void) > > { > > boot_idt_desc.address = (unsigned long)boot_idt; > > > > + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault); > > set_idt_entry(X86_TRAP_PF, boot_page_fault); > > So it's a bit sad to install a dummy handler that does nothing, while > something clearly sent an NMI and expects an intelligent reaction - OTOH > the unexpected NMIs from from watchdog during a kdump clearly make things > worse by crashing the bootup. > > Anyway, I cannot think of a better response here that the boot loading code > could do either, so I've applied your fix to tip:x86/boot. BTW., the changelog had very poor quality, and the patch added no comments to explain the presence of the dummy NMI. The -v2 version below should address most of those problems. Thanks, Ingo =============> From: Zeng Heng <zengheng4@huawei.com> Date: Tue, 10 Jan 2023 18:27:45 +0800 Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes If kdump is enabled, when using mce_inject to inject errors, EFI boot loader would decompress & load second kernel for saving the vmcore file. For normal errors that is fine. However, in the MCE case, the panic CPU that firstly enters into mce_panic() is running within NMI interrupt context, and the processor blocks delivery of subsequent NMIs until the next execution of the IRET instruction. When the panic CPU takes long time in the panic processing route, and causes the watchdog timeout, at this moment, the processor already receives NMI interrupt in the background. In the reproducer sequence below, panic CPU would run into EFI loader and raise page fault exception (like visiting `vidmem` variable when attempting to call debug_putstr()), the CPU would execute IRET instruction when it exits from the page fault handler. But the loader never registers handler for NMI vector in IDT, lack of vector handler would cause reboot, which interrupts kdump procedure and fails to save the vmcore file. Here is steps to reproduce the above issue (it's sporadic): 1. # cat uncorrected CPU 1 BANK 4 STATUS uncorrected 0xc0 MCGSTATUS EIPV MCIP ADDR 0x1234 RIP 0xdeadbabe RAISINGCPU 0 MCGCAP SER CMCI TES 0x6 2. # modprobe mce_inject 3. # mce-inject uncorrected For increasing the probability of reproduction of this issue, there are two ways to increase the probability of the bug: 1. modify the threshold value of watchdog (increase NMI frequency); 2. and/or add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro; Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table") Signed-off-by: Zeng Heng <zengheng4@huawei.com> [ Tidy up changelog, add comments. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230110102745.2514694-1-zengheng4@huawei.com --- arch/x86/boot/compressed/ident_map_64.c | 12 ++++++++++++ arch/x86/boot/compressed/idt_64.c | 1 + arch/x86/boot/compressed/idt_handlers_64.S | 1 + arch/x86/boot/compressed/misc.h | 1 + 4 files changed, 15 insertions(+)
Comments
On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote: > From: Zeng Heng <zengheng4@huawei.com> > Date: Tue, 10 Jan 2023 18:27:45 +0800 > Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes > > If kdump is enabled, when using mce_inject to inject errors, EFI Why does "EFI" matter here? Any boot loader would do... > boot loader would decompress & load second kernel for saving the s/&/and/ > vmcore file. > > For normal errors that is fine. Useless sentence. > However, in the MCE case, the panic > CPU that firstly enters into mce_panic() is running within NMI > interrupt context, "#MC context" it is non-maskable but that's not "NMI interrupt context" > and the processor blocks delivery of subsequent > NMIs until the next execution of the IRET instruction. > > When the panic CPU takes long time in the panic processing route, I'm still unclear on the order of events here. It sounds like 1. MCE injected 2. panic 3. kdump gets loaded If that is the case, then I presume the flow is: mce_panic -> panic -> __crash_kexec() Yes? If so, then we should make sure we have *exited* #MC context before calling panic() and not have to add hacks like this one of adding an empty NMI handler. But I'm only speculating as it is hard to make sense of all this text. Thx.
On Tue, Jan 10, 2023 at 01:11:29PM +0100, Borislav Petkov wrote: > On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote: > > From: Zeng Heng <zengheng4@huawei.com> > > Date: Tue, 10 Jan 2023 18:27:45 +0800 > > Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes > > > > If kdump is enabled, when using mce_inject to inject errors, EFI > > Why does "EFI" matter here? Any boot loader would do... > > > boot loader would decompress & load second kernel for saving the > > s/&/and/ > > > vmcore file. > > > > For normal errors that is fine. > > Useless sentence. > > > However, in the MCE case, the panic > > CPU that firstly enters into mce_panic() is running within NMI > > interrupt context, > > "#MC context" it is non-maskable but that's not "NMI interrupt context" > > > and the processor blocks delivery of subsequent > > NMIs until the next execution of the IRET instruction. > > > > When the panic CPU takes long time in the panic processing route, > > I'm still unclear on the order of events here. It sounds like > > 1. MCE injected > 2. panic > 3. kdump gets loaded > > If that is the case, then I presume the flow is: > > mce_panic -> panic -> __crash_kexec() > > Yes? > > If so, then we should make sure we have *exited* #MC context before calling > panic() and not have to add hacks like this one of adding an empty NMI handler. > > But I'm only speculating as it is hard to make sense of all this text. IOW, does this help? --- diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7832a69d170e..55437d8a4fad 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) if (panic_timeout == 0) panic_timeout = mca_cfg.panic_timeout; panic(msg); + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); } else pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
On 2023/1/10 20:11, Borislav Petkov wrote: > On Tue, Jan 10, 2023 at 01:01:06PM +0100, Ingo Molnar wrote: >> From: Zeng Heng <zengheng4@huawei.com> >> Date: Tue, 10 Jan 2023 18:27:45 +0800 >> Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes >> >> If kdump is enabled, when using mce_inject to inject errors, EFI > Why does "EFI" matter here? Any boot loader would do... > >> boot loader would decompress & load second kernel for saving the > s/&/and/ > >> vmcore file. >> >> For normal errors that is fine. > Useless sentence. > >> However, in the MCE case, the panic >> CPU that firstly enters into mce_panic() is running within NMI >> interrupt context, > "#MC context" it is non-maskable but that's not "NMI interrupt context" mce is registered on NMI handler by inject_init(). And here is the context of mce-inject: #0 relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55 #1 0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400) at arch/x86/kernel/machine_kexec_64.c:391 #2 0xffffffff811a9573 in __crash_kexec (regs=regs@entry=0x0 <fixed_percpu_data>) at kernel/kexec_core.c:1057 #3 0xffffffff81a5b4e4 in panic (fmt=fmt@entry=0xffffffff823211c8 "Fatal machine check") at kernel/panic.c:393 #4 0xffffffff81aa65f5 in mce_panic ( msg=msg@entry=0xffffffff823211c8 "Fatal machine check", final=final@entry=0xffff88813ac9eec0, exp=<optimized out>) at arch/x86/kernel/cpu/mce/core.c:380 #5 0xffffffff8103863b in mce_reign () at arch/x86/kernel/cpu/mce/core.c:1042 #6 0xffffffff81aa682f in mce_end (order=order@entry=1) at arch/x86/kernel/cpu/mce/core.c:1175 #7 0xffffffff81aa6d57 in do_machine_check (regs=0xfffffe00000beef8) at arch/x86/kernel/cpu/mce/core.c:1567 #8 0xffffffffc0495167 in raise_exception (m=m@entry=0xffff88813ad9ef40, pregs=<optimized out>) at arch/x86/kernel/cpu/mce/inject.c:152 #9 0xffffffffc0495e7f in mce_raise_notify (cmd=<optimized out>, regs=<optimized out>) at arch/x86/kernel/cpu/mce/inject.c:168 #10 0xffffffff810204b8 in nmi_handle () #11 0xffffffff81aa5e62 in default_do_nmi (regs=regs@entry=0xfffffe00000beef8) at arch/x86/kernel/nmi.c:335 #12 0xffffffff81aa608d in exc_nmi (regs=0xfffffe00000beef8) at arch/x86/kernel/nmi.c:517 #13 0xffffffff81c014e8 in asm_exc_nmi () at arch/x86/entry/entry_64.S:1440 > >> and the processor blocks delivery of subsequent >> NMIs until the next execution of the IRET instruction. >> >> When the panic CPU takes long time in the panic processing route, > I'm still unclear on the order of events here. It sounds like > > 1. MCE injected > 2. panic > 3. kdump gets loaded > > If that is the case, then I presume the flow is: > > mce_panic -> panic -> __crash_kexec() > > Yes? Yes, exactly. The following procedure is like: panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image -> EFI loader -> secondary kernel > > If so, then we should make sure we have *exited* #MC context before calling > panic() and not have to add hacks like this one of adding an empty NMI handler. > > But I'm only speculating as it is hard to make sense of all this text. > > Thx. >
* Borislav Petkov <bp@alien8.de> wrote: > > mce_panic -> panic -> __crash_kexec() > > > > Yes? > > > > If so, then we should make sure we have *exited* #MC context before calling > > panic() and not have to add hacks like this one of adding an empty NMI handler. > > > > But I'm only speculating as it is hard to make sense of all this text. > > IOW, does this help? > > --- > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 7832a69d170e..55437d8a4fad 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) > if (panic_timeout == 0) > panic_timeout = mca_cfg.panic_timeout; > panic(msg); > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); So your suggestion was to exit MC context 'before' the panic() call - but the patch calls it 'after' - was that intentional? Thanks, Ingo
On Tue, Jan 10, 2023 at 01:34:35PM +0100, Ingo Molnar wrote: > So your suggestion was to exit MC context 'before' the panic() call - but > the patch calls it 'after' - was that intentional? Bah, you're right - it should be: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); panic();
On 2023/1/10 20:34, Ingo Molnar wrote: > * Borislav Petkov <bp@alien8.de> wrote: > >>> mce_panic -> panic -> __crash_kexec() >>> >>> Yes? >>> >>> If so, then we should make sure we have *exited* #MC context before calling >>> panic() and not have to add hacks like this one of adding an empty NMI handler. >>> >>> But I'm only speculating as it is hard to make sense of all this text. >> IOW, does this help? >> >> --- >> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c >> index 7832a69d170e..55437d8a4fad 100644 >> --- a/arch/x86/kernel/cpu/mce/core.c >> +++ b/arch/x86/kernel/cpu/mce/core.c >> @@ -287,6 +287,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) >> if (panic_timeout == 0) >> panic_timeout = mca_cfg.panic_timeout; >> panic(msg); >> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); I'm willing to test any patch provided, but the panic() is never return and the mce_wrmsrl() would be never called. Am I wrong? B.R., Zeng Heng > So your suggestion was to exit MC context 'before' the panic() call - but > the patch calls it 'after' - was that intentional? > > Thanks, > > Ingo
On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote: > mce is registered on NMI handler by inject_init(). That's a handler for the NMI raised by raise_mce(). That's for the injection case, which is simulated. If you're fixing the injection case, then surely not with a bogus boot NMI handler. > Yes, exactly. The following procedure is like: > > panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image -> > EFI loader -> secondary kernel I'm doubtful now as you're injecting errors so you're not really in #MC context but in this contrived context which is actually an NMI one. So we need to think about how to fix this case. Certainly not with an empty NMI handler... Regardless, we should do diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7832a69d170e..57fe376ed049 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -286,6 +286,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) if (!fake_panic) { if (panic_timeout == 0) panic_timeout = mca_cfg.panic_timeout; + + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); panic(msg); } else pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg); so that we not run kexec in #MC context. Hmmm.
On 2023/1/10 20:57, Borislav Petkov wrote: > On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote: >> mce is registered on NMI handler by inject_init(). > That's a handler for the NMI raised by raise_mce(). That's for the injection > case, which is simulated. If you're fixing the injection case, then surely not > with a bogus boot NMI handler. OK, mce-injection is the simulated one. > >> Yes, exactly. The following procedure is like: >> >> panic() -> relocate_kernel() -> identity_mapped() -> x86 purgatory image -> >> EFI loader -> secondary kernel > I'm doubtful now as you're injecting errors so you're not really in #MC context > but in this contrived context which is actually an NMI one. So we need to think > about how to fix this case. > > Certainly not with an empty NMI handler... > > Regardless, we should do > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 7832a69d170e..57fe376ed049 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -286,6 +286,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) > if (!fake_panic) { > if (panic_timeout == 0) > panic_timeout = mca_cfg.panic_timeout; > + > + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > panic(msg); > } else > pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg); > > so that we not run kexec in #MC context. > > Hmmm. I don't have ready test case for real MCE to verify whether it has exited #MC context before panic() or not. In mce-inject case that based on NMI, it doesn't work as mentioned indeed. B.R., Zeng Heng
On 2023/1/10 20:01, Ingo Molnar wrote: > * Ingo Molnar <mingo@kernel.org> wrote: > >> * Zeng Heng <zengheng4@huawei.com> wrote: >> >>> +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code) >>> +{ >>> + /* ignore */ >>> +} >>> diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c >>> index 6debb816e83d..b169c9728d52 100644 >>> --- a/arch/x86/boot/compressed/idt_64.c >>> +++ b/arch/x86/boot/compressed/idt_64.c >>> @@ -60,6 +60,7 @@ void load_stage2_idt(void) >>> { >>> boot_idt_desc.address = (unsigned long)boot_idt; >>> >>> + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault); >>> set_idt_entry(X86_TRAP_PF, boot_page_fault); >> So it's a bit sad to install a dummy handler that does nothing, while >> something clearly sent an NMI and expects an intelligent reaction - OTOH >> the unexpected NMIs from from watchdog during a kdump clearly make things >> worse by crashing the bootup. >> >> Anyway, I cannot think of a better response here that the boot loading code >> could do either, so I've applied your fix to tip:x86/boot. > BTW., the changelog had very poor quality, and the patch added no comments > to explain the presence of the dummy NMI. > > The -v2 version below should address most of those problems. > > Thanks, > > Ingo Thanks for your work. B.R., Zeng Heng > =============> > From: Zeng Heng <zengheng4@huawei.com> > Date: Tue, 10 Jan 2023 18:27:45 +0800 > Subject: [PATCH] x86/boot/compressed: Register dummy NMI handler in EFI boot loader, to avoid kdump crashes > > If kdump is enabled, when using mce_inject to inject errors, EFI > boot loader would decompress & load second kernel for saving the > vmcore file. > > For normal errors that is fine. However, in the MCE case, the panic > CPU that firstly enters into mce_panic() is running within NMI > interrupt context, and the processor blocks delivery of subsequent > NMIs until the next execution of the IRET instruction. > > When the panic CPU takes long time in the panic processing route, > and causes the watchdog timeout, at this moment, the processor > already receives NMI interrupt in the background. > > In the reproducer sequence below, panic CPU would run into EFI loader > and raise page fault exception (like visiting `vidmem` variable > when attempting to call debug_putstr()), the CPU would execute IRET > instruction when it exits from the page fault handler. > > But the loader never registers handler for NMI vector in IDT, > lack of vector handler would cause reboot, which interrupts > kdump procedure and fails to save the vmcore file. > > Here is steps to reproduce the above issue (it's sporadic): > > 1. # cat uncorrected > CPU 1 BANK 4 > STATUS uncorrected 0xc0 > MCGSTATUS EIPV MCIP > ADDR 0x1234 > RIP 0xdeadbabe > RAISINGCPU 0 > MCGCAP SER CMCI TES 0x6 > 2. # modprobe mce_inject > 3. # mce-inject uncorrected > > For increasing the probability of reproduction of this issue, there are > two ways to increase the probability of the bug: > > 1. modify the threshold value of watchdog (increase NMI frequency); > 2. and/or add delays before panic() in mce_panic() and modify PANIC_TIMEOUT macro; > > Fixes: ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table") > Signed-off-by: Zeng Heng <zengheng4@huawei.com> > [ Tidy up changelog, add comments. ] > Signed-off-by: Ingo Molnar <mingo@kernel.org> > Link: https://lore.kernel.org/r/20230110102745.2514694-1-zengheng4@huawei.com > --- > arch/x86/boot/compressed/ident_map_64.c | 12 ++++++++++++ > arch/x86/boot/compressed/idt_64.c | 1 + > arch/x86/boot/compressed/idt_handlers_64.S | 1 + > arch/x86/boot/compressed/misc.h | 1 + > 4 files changed, 15 insertions(+) > > diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c > index d4a314cc50d6..cbfdefcf9657 100644 > --- a/arch/x86/boot/compressed/ident_map_64.c > +++ b/arch/x86/boot/compressed/ident_map_64.c > @@ -379,3 +379,15 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code) > */ > kernel_add_identity_map(address, end); > } > + > +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code) > +{ > + /* > + * Default boot loader placeholder fault handler - there's no real > + * kernel running yet, so there's not much we can do - but NMIs > + * can arrive in a kdump scenario, for example by the NMI watchdog. > + * > + * Not having any handler would cause the CPU to silently reboot, > + * so we do the second-worst thing here and ignore the NMI. > + */ > +} > diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c > index 6debb816e83d..b169c9728d52 100644 > --- a/arch/x86/boot/compressed/idt_64.c > +++ b/arch/x86/boot/compressed/idt_64.c > @@ -60,6 +60,7 @@ void load_stage2_idt(void) > { > boot_idt_desc.address = (unsigned long)boot_idt; > > + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault); > set_idt_entry(X86_TRAP_PF, boot_page_fault); > > #ifdef CONFIG_AMD_MEM_ENCRYPT > diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S > index 22890e199f5b..2aef8e1b515b 100644 > --- a/arch/x86/boot/compressed/idt_handlers_64.S > +++ b/arch/x86/boot/compressed/idt_handlers_64.S > @@ -69,6 +69,7 @@ SYM_FUNC_END(\name) > .text > .code64 > > +EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0 > EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1 > > #ifdef CONFIG_AMD_MEM_ENCRYPT > diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h > index 62208ec04ca4..d89d3f8417f6 100644 > --- a/arch/x86/boot/compressed/misc.h > +++ b/arch/x86/boot/compressed/misc.h > @@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { } > #endif > > /* IDT Entry Points */ > +void boot_nmi_fault(void); > void boot_page_fault(void); > void boot_stage1_vc(void); > void boot_stage2_vc(void);
On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote: > And here is the context of mce-inject: > > #0 relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55 > #1 0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400) > at arch/x86/kernel/machine_kexec_64.c:391 Before we continue with this any further: are you doing this "exercise" in qemu/kvm and nothing of that is happening on real hardware?
On Tue, Jan 10, 2023 at 01:57:05PM +0100, Borislav Petkov wrote: > I'm doubtful now as you're injecting errors so you're not really in #MC context > but in this contrived context which is actually an NMI one. So we need to think > about how to fix this case. I did some more thinking: *if* this really is a real issue - and not some silly qemu games - then native_machine_crash_shutdown() does all the cleanup before the kdump kernel is started. Any NMI clearing, maybe using iret_to_self() etc, #MC resetting etc should happen there and not anywhere else.
On 2023/1/10 22:53, Borislav Petkov wrote: > On Tue, Jan 10, 2023 at 08:32:07PM +0800, Zeng Heng wrote: >> And here is the context of mce-inject: >> >> #0 relocate_kernel () at arch/x86/kernel/relocate_kernel_64.S:55 >> #1 0xffffffff81a57fc2 in machine_kexec (image=0xffff888101ef8400) >> at arch/x86/kernel/machine_kexec_64.c:391 > Before we continue with this any further: are you doing this "exercise" in > qemu/kvm and nothing of that is happening on real hardware? Real hardware and QEMU both can reproduce the issue. Here is description about NMI from Intel 64 and IA-32 Architectures Software Developer's Manual in chapter 6.7.1: While an NMI interrupt handler is executing, the processor blocks delivery of subsequent NMIs until the next execution of the IRET instruction. This blocking of NMIs prevents nested execution of the NMI handler. It is recommended that the NMI interrupt handler be accessed through an interrupt gate to disable maskable hardware interrupts (see Section 6.8.1, “Masking Maskable Hardware Interrupts”).
On 2023/1/11 0:09, Borislav Petkov wrote: > On Tue, Jan 10, 2023 at 01:57:05PM +0100, Borislav Petkov wrote: >> I'm doubtful now as you're injecting errors so you're not really in #MC context >> but in this contrived context which is actually an NMI one. So we need to think >> about how to fix this case. > I did some more thinking: > > *if* this really is a real issue - and not some silly qemu games - then > native_machine_crash_shutdown() does all the cleanup before the kdump kernel is > started. > > Any NMI clearing, maybe using iret_to_self() etc, #MC resetting etc should > happen there and not anywhere else. You mean native_machine_crash_shutdown() should cleanup the NMI interrupt status before enter kexec? But how about the watchdog raise NMI interrupt after native_machine_crash_shutdown() or mce_wrmsrl(MSR_IA32_MCG_STATUS, 0) or any else cleanup function? B.R., Zeng Heng
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c index d4a314cc50d6..cbfdefcf9657 100644 --- a/arch/x86/boot/compressed/ident_map_64.c +++ b/arch/x86/boot/compressed/ident_map_64.c @@ -379,3 +379,15 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code) */ kernel_add_identity_map(address, end); } + +void do_boot_nmi_fault(struct pt_regs *regs, unsigned long error_code) +{ + /* + * Default boot loader placeholder fault handler - there's no real + * kernel running yet, so there's not much we can do - but NMIs + * can arrive in a kdump scenario, for example by the NMI watchdog. + * + * Not having any handler would cause the CPU to silently reboot, + * so we do the second-worst thing here and ignore the NMI. + */ +} diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c index 6debb816e83d..b169c9728d52 100644 --- a/arch/x86/boot/compressed/idt_64.c +++ b/arch/x86/boot/compressed/idt_64.c @@ -60,6 +60,7 @@ void load_stage2_idt(void) { boot_idt_desc.address = (unsigned long)boot_idt; + set_idt_entry(X86_TRAP_NMI, boot_nmi_fault); set_idt_entry(X86_TRAP_PF, boot_page_fault); #ifdef CONFIG_AMD_MEM_ENCRYPT diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S index 22890e199f5b..2aef8e1b515b 100644 --- a/arch/x86/boot/compressed/idt_handlers_64.S +++ b/arch/x86/boot/compressed/idt_handlers_64.S @@ -69,6 +69,7 @@ SYM_FUNC_END(\name) .text .code64 +EXCEPTION_HANDLER boot_nmi_fault do_boot_nmi_fault error_code=0 EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1 #ifdef CONFIG_AMD_MEM_ENCRYPT diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index 62208ec04ca4..d89d3f8417f6 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -187,6 +187,7 @@ static inline void cleanup_exception_handling(void) { } #endif /* IDT Entry Points */ +void boot_nmi_fault(void); void boot_page_fault(void); void boot_stage1_vc(void); void boot_stage2_vc(void);