Message ID | 20230925074426.97856-1-xueshuai@linux.alibaba.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp1042613vqu; Mon, 25 Sep 2023 00:51:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEx3KVOTx/btLYToHY0slFvCkpcL8FOJfWYYXmV1BJ6Cp3x+moj5Q75LAHHpmK0U9bxYXfc X-Received: by 2002:a05:6808:1a0f:b0:3ad:eefd:3755 with SMTP id bk15-20020a0568081a0f00b003adeefd3755mr7820692oib.42.1695628305738; Mon, 25 Sep 2023 00:51:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695628305; cv=none; d=google.com; s=arc-20160816; b=ARXX/fhMFEPYIEHjAgqcFV0eLr5rxdJSp0YaLN8XURcvHoknQ/NpF5wMWSCNJzMLci xL8Rhl6ZLO4HjwiGRfhBrJHxslN48pYHz9+eXhJLNrXaFR0doi6VWDojbB93WWlR6LUF HbngU/79gK5XJAfbxHWkYyP41O/g3QNwIxyXY39KZL3gjSKZdYeeGpD45lhAkxupQyZf GM9UDLSMNB2p1ylih6Sp4hShGGQn+xfi8prbXMmFz1t0oOALYSiwJp7+r0f0F7W6PwZO TZnSeEHjnRll8UDy5+aLilAGO8TqeAQ2KC5laj7DJORT60i27w0a3uj9AINRUTCFgsBd VA1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=YOxOGj9uY0teS7jjOOsfoFimMBIGP5aJjV7VwSDAFXQ=; fh=FxxuQyV5zt5uOZlISi71Gibj4pONM7+jzC8ls2GxgWY=; b=ZAcTcPxo1tpGpe9+3NFNI+GjRfJ7WhLL9QVBPd9ldDVbseHfHLDWY6KI4t/g6VzpaI OW9bqCLLlaRSMWQb2E7Z8M6OdP0N1do2138hRClF65zyr5HZa5BPhdP9MEQKq64FA6I+ Ql3jsfYMh69OPmXiEQFaiPadq61gI5QLOtvvdl51me7d8Pf3oU7InXSFHLGQ/1zDxG0h cRr9rtOMFMqaDA5lhpGRS8lVmV2gS4omSbU0xi+e9sDboWFrXrOV++wVK7BbHa9SlQgd CDicqmcCe2LIjtMGHrl06X2hmb9sT2iD+7L4JuBqq5T8a5QqMmIV5Y4On7b0bwTd5TO2 fc1A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id by7-20020a056a02058700b00573fa4cfe4asi10083519pgb.39.2023.09.25.00.51.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Sep 2023 00:51:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 5FF0D8182EC2; Mon, 25 Sep 2023 00:45:06 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232446AbjIYHou (ORCPT <rfc822;ezelljr.billy@gmail.com> + 30 others); Mon, 25 Sep 2023 03:44:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232486AbjIYHon (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 25 Sep 2023 03:44:43 -0400 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 49A8EDA; Mon, 25 Sep 2023 00:44:35 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=22;SR=0;TI=SMTPD_---0VsmvswX_1695627867; Received: from localhost.localdomain(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0VsmvswX_1695627867) by smtp.aliyun-inc.com; Mon, 25 Sep 2023 15:44:31 +0800 From: Shuai Xue <xueshuai@linux.alibaba.com> To: keescook@chromium.org, tony.luck@intel.com, gpiccoli@igalia.com, rafael@kernel.org, lenb@kernel.org, james.morse@arm.com, bp@alien8.de, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, ardb@kernel.org, robert.moore@intel.com Cc: linux-hardening@vger.kernel.org, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, linux-efi@vger.kernel.org, acpica-devel@lists.linuxfoundation.org, xueshuai@linux.alibaba.com, baolin.wang@linux.alibaba.com Subject: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors Date: Mon, 25 Sep 2023 15:44:17 +0800 Message-Id: <20230925074426.97856-1-xueshuai@linux.alibaba.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 25 Sep 2023 00:45:06 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777995146085965627 X-GMAIL-MSGID: 1777995146085965627 |
Series |
Use ERST for persistent storage of MCE and APEI errors
|
|
Message
Shuai Xue
Sept. 25, 2023, 7:44 a.m. UTC
changes log since v1: - fix a compile waring by dereferencing rcd pointer before memset - add a compile error by add CONFIG_X86_MCE - Link: https://lore.kernel.org/all/20230916130316.65815-3-xueshuai@linux.alibaba.com/ In certain scenarios (ie. hosts/guests with root filesystems on NFS/iSCSI where networking software and/or hardware fails, and thus kdump fails), it is necessary to serialize hardware error information available for post-mortem debugging. Save the hardware error log into flash via ERST before go panic, the hardware error log can be gotten from the flash after system boot successful again, which is very useful in production. On X86 platform, the kernel has supported to serialize and deserialize MCE error record by commit 482908b49ebf ("ACPI, APEI, Use ERST for persistent storage of MCE"). The process involves two steps: - MCE Producer: When a hardware error is detected, MCE raised and its handler writes MCE error record into flash via ERST before panic - MCE Consumor: After system reboot, /sbin/mcelog run, it reads /dev/mcelog to check flash for error record of previous boot via ERST After /dev/mcelog character device deprecated by commit 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the serialized MCE error record, of previous boot in persistent storage is not collected via APEI ERST. This patch set include two part: - PATCH 1-3: rework apei_{read,write}_mce to use pstore data structure and emit the mce_record tracepoint, enabling the collection of MCE records by the rasdaemon tool. - PATCH 4-9: use ERST for persistent storage of APEI errors, and emit tracepoints for CPER sections, enabling the collection of MCE records by the rasdaemon tool. Shuai Xue (9): pstore: move pstore creator id, section type and record struct to common header ACPI: APEI: Use common ERST struct to read/write serialized MCE record ACPI: APEI: ERST: Emit the mce_record tracepoint ACPI: tables: change section_type of generic error data as guid_t ACPI: APEI: GHES: Use ERST to serialize APEI generic error before panic ACPI: APEI: GHES: export ghes_report_chain ACPI: APEI: ESRT: kick ghes_report_chain notifier to report serialized memory errors ACPI: APEI: ESRT: print AER to report serialized PCIe errors ACPI: APEI: ESRT: log ARM processor error arch/x86/kernel/cpu/mce/apei.c | 82 +++++++++++++++------------------- drivers/acpi/acpi_extlog.c | 2 +- drivers/acpi/apei/erst.c | 55 ++++++++++++++--------- drivers/acpi/apei/ghes.c | 48 +++++++++++++++++++- drivers/firmware/efi/cper.c | 2 +- fs/pstore/platform.c | 3 ++ include/acpi/actbl1.h | 5 ++- include/acpi/ghes.h | 2 +- include/linux/pstore.h | 29 ++++++++++++ 9 files changed, 154 insertions(+), 74 deletions(-)
Comments
On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote: > After /dev/mcelog character device deprecated by commit 5de97c9f6d85 > ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the > serialized MCE error record, of previous boot in persistent storage is not > collected via APEI ERST. You lost me here. /dev/mcelog is deprecated but you can still use it and apei_write_mce() still happens. Looking at your patches, you're adding this to ghes so how about you sit down first and explain your exact use case and what exactly you wanna do? Thx.
On 2023/9/28 22:43, Borislav Petkov wrote: > On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote: >> After /dev/mcelog character device deprecated by commit 5de97c9f6d85 >> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the >> serialized MCE error record, of previous boot in persistent storage is not >> collected via APEI ERST. > > You lost me here. /dev/mcelog is deprecated but you can still use it and > apei_write_mce() still happens. Yes, you are right. apei_write_mce() still happens so that MCE records are written to persistent storage and the MCE records can be retrieved by apei_read_mce(). Previously, the task was performed by the mcelog package. However, it has been deprecated, some distributions like Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY.[1] So, IMHO, it's better to add a way to retrieve MCE records through switching to the new generation rasdaemon solution. > > Looking at your patches, you're adding this to ghes so how about you sit > down first and explain your exact use case and what exactly you wanna > do? > > Thx. > Sorry for the poor cover letter. I hope the following response can clarify the matter. Q1: What is the exact problem? Traditionally, fatal hardware errors will cause Linux print error log to console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With Linux, the primary method for obtaining debugging information of a serious error or fault is via the kdump mechanism. Kdump captures a wealth of kernel and machine state and writes it to a file for post-mortem debugging. In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI where networking software and/or hardware fails, and thus kdump fails to collect the hardware error context, leaving us unaware of what actually occurred. In the public cloud scenario, multiple virtual machines run on a single physical server, and if that server experiences a failure, it can potentially impact multiple tenants. It is crucial for us to thoroughly analyze the root causes of each instance failure in order to: - Provide customers with a detailed explanation of the outage to reassure them. - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction. - Explore potential solutions to prevent widespread outages. In short, it is necessary to serialize hardware error information available for post-mortem debugging. Q2: What exactly I wanna do: The MCE handler, do_machine_check(), saves the MCE record to persistent storage and it is retrieved by mcelog. Mcelog has been deprecated when kernel 4.12 released in 2017, and the help of the configuration option CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new generation rasdaemon solution. The GHES handler does not support APEI error record now. To serialize hardware error information available for post-mortem debugging: - add support to save APEI error record into flash via ERST before go panic, - add support to retrieve MCE or APEI error record from the flash and emit the related tracepoint after system boot successful again so that rasdaemon can collect them Best Regards, Shuai [1] https://wiki.archlinux.org/title/Machine-check_exception
On 2023/10/7 15:15, Shuai Xue wrote: > > > On 2023/9/28 22:43, Borislav Petkov wrote: >> On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote: >>> After /dev/mcelog character device deprecated by commit 5de97c9f6d85 >>> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the >>> serialized MCE error record, of previous boot in persistent storage is not >>> collected via APEI ERST. >> >> You lost me here. /dev/mcelog is deprecated but you can still use it and >> apei_write_mce() still happens. > > Yes, you are right. apei_write_mce() still happens so that MCE records are > written to persistent storage and the MCE records can be retrieved by > apei_read_mce(). Previously, the task was performed by the mcelog package. > However, it has been deprecated, some distributions like Arch kernels are > not even compiled with the necessary configuration option > CONFIG_X86_MCELOG_LEGACY.[1] > > So, IMHO, it's better to add a way to retrieve MCE records through switching > to the new generation rasdaemon solution. > >> >> Looking at your patches, you're adding this to ghes so how about you sit >> down first and explain your exact use case and what exactly you wanna >> do? >> >> Thx. >> > > Sorry for the poor cover letter. I hope the following response can clarify > the matter. > > Q1: What is the exact problem? > > Traditionally, fatal hardware errors will cause Linux print error log to > console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With > Linux, the primary method for obtaining debugging information of a serious > error or fault is via the kdump mechanism. Kdump captures a wealth of > kernel and machine state and writes it to a file for post-mortem debugging. > > In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI > where networking software and/or hardware fails, and thus kdump fails to > collect the hardware error context, leaving us unaware of what actually > occurred. In the public cloud scenario, multiple virtual machines run on a > single physical server, and if that server experiences a failure, it can > potentially impact multiple tenants. It is crucial for us to thoroughly > analyze the root causes of each instance failure in order to: > > - Provide customers with a detailed explanation of the outage to reassure them. > - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction. > - Explore potential solutions to prevent widespread outages. > > In short, it is necessary to serialize hardware error information available > for post-mortem debugging. > > Q2: What exactly I wanna do: > > The MCE handler, do_machine_check(), saves the MCE record to persistent > storage and it is retrieved by mcelog. Mcelog has been deprecated when > kernel 4.12 released in 2017, and the help of the configuration option > CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new > generation rasdaemon solution. The GHES handler does not support APEI error > record now. > > To serialize hardware error information available for post-mortem > debugging: > - add support to save APEI error record into flash via ERST before go panic, > - add support to retrieve MCE or APEI error record from the flash and emit > the related tracepoint after system boot successful again so that rasdaemon > can collect them > > > Best Regards, > Shuai > > > [1] https://wiki.archlinux.org/title/Machine-check_exception Hi, Borislav, I would like to inquire about your satisfaction with the motivation provided. If you have no objections, I am prepared to address Kees's comments, update the cover letter, and proceed with sending a new version. Thank you. Best Regards, Shuai
On Sat, Oct 07, 2023 at 03:15:45PM +0800, Shuai Xue wrote: > So, IMHO, it's better to add a way to retrieve MCE records through switching > to the new generation rasdaemon solution. rasdaemon already collects errors and even saves them in a database of sorts. No kernel changes needed. > Sorry for the poor cover letter. I hope the following response can clarify > the matter. > > Q1: What is the exact problem? > > Traditionally, fatal hardware errors will cause Linux print error log to > console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With > Linux, the primary method for obtaining debugging information of a serious > error or fault is via the kdump mechanism. Not necessarily - see above. > In the public cloud scenario, multiple virtual machines run on a > single physical server, and if that server experiences a failure, it can > potentially impact multiple tenants. It is crucial for us to thoroughly > analyze the root causes of each instance failure in order to: > > - Provide customers with a detailed explanation of the outage to reassure them. > - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction. > - Explore potential solutions to prevent widespread outages. Huh, are you talking about providing customers with error information from the *underlying* physical machine which runs the cloud VMs? That sounds suspicious, to say the least. AFAICT, all you can tell the VM owner is: yah, the hw had an uncorrectable error in its memory and crashed. Is that the use case? To be able to tell the VM owners why it crashed? > In short, it is necessary to serialize hardware error information available > for post-mortem debugging. > > Q2: What exactly I wanna do: > > The MCE handler, do_machine_check(), saves the MCE record to persistent > storage and it is retrieved by mcelog. Mcelog has been deprecated when > kernel 4.12 released in 2017, and the help of the configuration option > CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new > generation rasdaemon solution. The GHES handler does not support APEI error > record now. I think you're confusing things: MCEs do get reported to userspace through the trace_mc_record tracepoint and rasdaemon opens it and reads error info from there. And then writes it out to its db. So that works now. GHES is something different: it is a fw glue around error reporting so that you don't have to develop a reporting driver for every platform but you can use a single one - only the fw glue needs to be added. The problem with GHES is that it is notoriously buggy and currently it loads on a single platform only on x86. ARM are doing something in that area - you're better off talking to James Morse about it. And he's on Cc. > To serialize hardware error information available for post-mortem > debugging: > - add support to save APEI error record into flash via ERST before go panic, > - add support to retrieve MCE or APEI error record from the flash and emit > the related tracepoint after system boot successful again so that rasdaemon > can collect them Now that is yet another thing: you want to save error records into firmware. First of all, you don't really need it if you do kdump as explained above. Then, that thing has its own troubles: it is buggy like every firmware is and it can brick the machine. I'm not saying it is not useful - there are some use cases for it which are being worked on but if all you wanna do is dump MCEs to rasdaemon, that works even now. But then you have an ARM patch there and I'm confused because MCEs are x86 thing - ARM has different stuff. So I think you need to elaborate more here. Thx.