Message ID | 20231220-cxl-cper-v5-0-1bb8a4ca2c7a@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2483:b0:fb:cd0c:d3e with SMTP id q3csp97609dyi; Wed, 20 Dec 2023 16:25:24 -0800 (PST) X-Google-Smtp-Source: AGHT+IFXV8sp6SzJoabMJmjEGfXBWLARmAvSfIkEZrybNRGySSINGZ/UOqVHb0bE5Tf0E/NLKnq1 X-Received: by 2002:a19:c214:0:b0:50e:5a12:4463 with SMTP id l20-20020a19c214000000b0050e5a124463mr224294lfc.18.1703118324244; Wed, 20 Dec 2023 16:25:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703118324; cv=none; d=google.com; s=arc-20160816; b=akMKg7yyVK+83DehHE0cMdQFtBitr3A5433kPRGpD+wIdnq59Ujv4ZX3Q5Txg7rhfJ qxBxjDNz2z539zbylDrzStQUklNf/qxfSr/qrNGLPotbudtr5RQ5AFCR/cp5a5JRADBQ isWiDlasfiQi7nE3uz4qiXjPqYx4j2HVtcZNRLfoBLco8m+A2JfoYG91ysBWIx5eqwFK fr3xbt7yqpN/4OvnmLZERE7+3tMA7jhT11LBKrUcNhAsDhyHzqJipPUCg9feUMS2DRxe 4tcngn8SXSgC81lUSWRZGfk1N8nUJnQrx/kPeIjM6E4YZ8486yLd36F0N+yboKCMrnum OSdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:from :dkim-signature; bh=iiQmepi70D/Q52auF3OD9szSd7F37RnilcxfYbh+0OE=; fh=Ws0xJiZK7kDkNi+hMNt5xwovY/0Ec3xcsqqaTU/oNFk=; b=vMlZCLk1IVoeUvLdcH8mbaV3atcb+mqKzdJOH3itKYiz61z0Kn/kB1GgqxpPBfRrU+ jgl8pfmaVLKnqqkebSqfe7bzkgpfoGUyWEHT+misECAgOe9dtciSS07JogBLc8Gf9l6g gDgL8tTaU4KM4d9FcHGgn1x7euclmGAcvDx5/ze+DC+Z/biM/IlOc5ikh5NH3gaYMkLf ZVSdmo1EY/5k1pusr2G3RJrRZlWINKUPXuCF8OzjYpPwQCSWvcoGKxPWRoqEaJWTuZcI c6OSnqtbe316rkVnl/zIxhabu/YG70pnFJMe2UUweaXlqQQcM2draBCiB0/RPlS+Tpse DIXw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ix0cdLIp; spf=pass (google.com: domain of linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id z26-20020a170906075a00b00a269f8fe409si275423ejb.390.2023.12.20.16.25.24 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 16:25:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ix0cdLIp; spf=pass (google.com: domain of linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7688-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id D209E1F2457F for <ouuuleilei@gmail.com>; Thu, 21 Dec 2023 00:25:23 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6D19328FC; Thu, 21 Dec 2023 00:24:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ix0cdLIp" X-Original-To: linux-kernel@vger.kernel.org Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB8F5370; Thu, 21 Dec 2023 00:24:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1703118281; x=1734654281; h=from:subject:date:message-id:mime-version: content-transfer-encoding:to:cc; bh=QzeXiqvA/RcvyC4x0SwVxlIEq7c/c29xjKXw4I1UKu4=; b=ix0cdLIpo3YhMsPrwTGe/sUikJ0SOXumbotXjdyT6F2TCvpn1QHNJAQ3 o7m+9BqzdmZew+xuUGxX4/8uIp//H1cbc0xIJgTgTwqMgYbuyMyMHzhqo fmhiQxWto3EJTMSRbJ1EoBYSmzd9yAzkaMKpEkpM1lW+TB54c5g7avXBw wTSImBZw0qK0CdKFEswp31GNf7K4w3C0mH9nsfZUufiUM4hYjMVGAIN28 iohDp9DYG9mxIQGKpiM/FJPzwu4MSmaEJf2lD/JyVMoBiuOBvN7pAmihC Tq/vLtlYZG4hmIsF1l55g8IZUGg1B0459gt7VQCksNFsI5i6gQtdVS+Wj A==; X-IronPort-AV: E=McAfee;i="6600,9927,10930"; a="2730042" X-IronPort-AV: E=Sophos;i="6.04,292,1695711600"; d="scan'208";a="2730042" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Dec 2023 16:24:38 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10930"; a="1023661604" X-IronPort-AV: E=Sophos;i="6.04,292,1695711600"; d="scan'208";a="1023661604" Received: from iweiny-desk3.amr.corp.intel.com (HELO localhost) ([10.212.30.219]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Dec 2023 16:24:37 -0800 From: Ira Weiny <ira.weiny@intel.com> Subject: [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events Date: Wed, 20 Dec 2023 16:17:27 -0800 Message-Id: <20231220-cxl-cper-v5-0-1bb8a4ca2c7a@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-B4-Tracking: v=1; b=H4sIABeEg2UC/3XNS27DIBCA4atErEvF8BibrnqPqgsYoEFK7AgsK 1XkuxdnFbn2cmD++R6sxpJjZR+nBytxzjWPQxvM24nR2Q0/kefQZiaFVAIFcLpfON1i4RJTol5 ZQkLW1r2rkfviBjqvwdXVKZb141Ziyven8fXd5nOu01h+n+QM6+vO9Rm44NZ6p1XqU9eJzzxM8 fJO43W9eRAEsAkcQe88vgSrOssjSbZQgQ6WFDov5H8JJGwDAkArhHdo+q2kjiTVQgG9DQipswZ 3JbUJuhR6VCGIIM1W0keSbqHuvLcCEigTdyWzCQR4DM5rnSi8Ssuy/AH4mFYSKAIAAA== To: Dan Williams <dan.j.williams@intel.com>, Jonathan Cameron <jonathan.cameron@huawei.com>, Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>, Shiju Jose <shiju.jose@huawei.com> Cc: Yazen Ghannam <yazen.ghannam@amd.com>, Davidlohr Bueso <dave@stgolabs.net>, Dave Jiang <dave.jiang@intel.com>, Alison Schofield <alison.schofield@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Ard Biesheuvel <ardb@kernel.org>, linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, Ira Weiny <ira.weiny@intel.com>, Jonathan Cameron <Jonathan.Cameron@Huawei.com>, "Rafael J. Wysocki" <rafael@kernel.org>, Bjorn Helgaas <bhelgaas@google.com> X-Mailer: b4 0.13-dev-2539e X-Developer-Signature: v=1; a=ed25519-sha256; t=1703118276; l=3656; i=ira.weiny@intel.com; s=20221222; h=from:subject:message-id; bh=QzeXiqvA/RcvyC4x0SwVxlIEq7c/c29xjKXw4I1UKu4=; b=vMkwSiVffu+Am0wL3komMMO/XzimqrJwRo6xKY6gKVhu2XzhmRSUS2NDKlLseINxWLVRN5gBP OzHqVvI2XvRDGVriOOpedSLS2gS9Ub+ljvE6u5HItDIBvnRDRcPkzzS X-Developer-Key: i=ira.weiny@intel.com; a=ed25519; pk=brwqReAJklzu/xZ9FpSsMPSQ/qkSalbg6scP3w809Ec= X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785848999834537080 X-GMAIL-MSGID: 1785848999834537080 |
Series |
efi/cxl-cper: Report CPER CXL component events through trace events
|
|
Message
Ira Weiny
Dec. 21, 2023, 12:17 a.m. UTC
Series status/background
========================
Smita has been a great help with this series. Thank you again!
Smita's testing found that the GHES code ended up printing the events
twice. This version avoids the duplicate print by calling the callback
from the GHES code instead of the EFI code as suggested by Dan.
Dependencies
============
NOTE this series still depends on Dan's addition of a device guard[1].
Therefore, the base commit is not a stable commit. I've pushed a branch
with this commit included for testing if folks are interested.[2]
[1] https://lore.kernel.org/all/170250854466.1522182.17555361077409628655.stgit@dwillia2-xfh.jf.intel.com/
[2] https://github.com/weiny2/linux-kernel/tree/cxl-cper-2023-12-20
Cover letter
============
CXL Component Events, as defined by EFI 2.10 Section N.2.14, wrap a
mostly CXL event payload in an EFI Common Platform Error Record (CPER)
record. If a device is configured for firmware first CXL event records
are not sent directly to the host.
The CXL sub-system uniquely has DPA to HPA translation information. It
also already has event format tracing. Restructure the code to make
sharing the data between CPER/event logs most efficient. Then send the
CXL CPER records to the CXL sub-system for processing.
With event logs the events interrupt the driver directly. In the EFI
case events are wrapped with device information which allows the CXL
subsystem to identify the PCI device.
Previous version considered matching the memdev differently. However,
the most robust was to find the PCI device via Bus, Device, Function and
use the PCI device to find the driver data.
CPER records are identified with GUID's while CXL event logs contain
UUID's. The UUID is reported for all events no matter the source.
While the UUID is redundant for the known events the UUID's are already
used by rasdaemon. To keep compatibility UUIDs are still reported.
In addition this series cleans up the UUID defines.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
Changes in v5:
- Smita/djbw: trigger trace from ghes_do_proc()
- Jonathan: split out pci scoped based functions to it's own patch
- Jonathan: remove unneeded static uuid variables
- Smita/djbw: trace an unknown event type as a generic with null UUID
- Jonathan: code clean ups
- Link to v4: https://lore.kernel.org/r/20231215-cxl-cper-v4-0-01b6dab44fcd@intel.com
---
Ira Weiny (9):
cxl/trace: Pass uuid explicitly to event traces
cxl/events: Promote CXL event structures to a core header
cxl/events: Create common event UUID defines
cxl/events: Remove passing a UUID to known event traces
cxl/events: Separate UUID from event structures
cxl/events: Create a CXL event union
acpi/ghes: Process CXL Component Events
PCI: Define scoped based management functions
cxl/pci: Register for and process CPER events
drivers/acpi/apei/ghes.c | 88 +++++++++++++++++++++++
drivers/cxl/core/mbox.c | 87 +++++++++++------------
drivers/cxl/core/trace.h | 14 ++--
drivers/cxl/cxlmem.h | 110 +++++++----------------------
drivers/cxl/pci.c | 58 ++++++++++++++-
include/linux/cxl-event.h | 162 ++++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 2 +
tools/testing/cxl/test/mem.c | 163 ++++++++++++++++++++++++-------------------
8 files changed, 476 insertions(+), 208 deletions(-)
---
base-commit: 6436863dfabce0d7ac416c8dc661fd513b967d39
change-id: 20230601-cxl-cper-26ffc839c6c6
Best regards,
Comments
On Wed, Dec 20, 2023 at 04:17:27PM -0800, Ira Weiny wrote: > cxl/trace: Pass uuid explicitly to event traces Nit: s/uuid/UUID/ would match the patches below > cxl/events: Promote CXL event structures to a core header > cxl/events: Create common event UUID defines > cxl/events: Remove passing a UUID to known event traces > cxl/events: Separate UUID from event structures > cxl/events: Create a CXL event union > acpi/ghes: Process CXL Component Events > PCI: Define scoped based management functions "scope based" unless I'm misunderstanding something. Maybe "cleanup and guard functions"? "management" is pretty generic.
On Wed, 20 Dec 2023 16:17:27 -0800 Ira Weiny <ira.weiny@intel.com> wrote: > Series status/background > ======================== > > Smita has been a great help with this series. Thank you again! > > Smita's testing found that the GHES code ended up printing the events > twice. This version avoids the duplicate print by calling the callback > from the GHES code instead of the EFI code as suggested by Dan. I'm not sure this is working as intended. There is nothing gating the call in ghes_proc() of ghes_print_estatus() and now the EFI code handling that pretty printed things is missing we get the horrible kernel logging for an unknown block instead. So I think we need some minimal code in cper.c to match the guids then not log them (on basis we are arguing there is no need for new cper records). Otherwise we are in for some messy kernel logs Something like: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 {1}[Hardware Error]: event severity: recoverable {1}[Hardware Error]: Error 0, type: recoverable {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 {1}[Hardware Error]: section length: 0x90 {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' (I'm filling the record with 0s currently)
On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > On Wed, 20 Dec 2023 16:17:27 -0800 > Ira Weiny <ira.weiny@intel.com> wrote: > >> Series status/background >> ======================== >> >> Smita has been a great help with this series. Thank you again! >> >> Smita's testing found that the GHES code ended up printing the events >> twice. This version avoids the duplicate print by calling the callback >> from the GHES code instead of the EFI code as suggested by Dan. > > I'm not sure this is working as intended. > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > and now the EFI code handling that pretty printed things is missing we get > the horrible kernel logging for an unknown block instead. > > So I think we need some minimal code in cper.c to match the guids then not > log them (on basis we are arguing there is no need for new cper records). > Otherwise we are in for some messy kernel logs > > Something like: > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > {1}[Hardware Error]: event severity: recoverable > {1}[Hardware Error]: Error 0, type: recoverable > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > {1}[Hardware Error]: section length: 0x90 > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > (I'm filling the record with 0s currently) Yeah, when I tested this, I thought its okay for the hexdump to be there in dmesg from EFI as the handling is done in trace events from GHES. If, we need to handle from EFI, then it would be a good reason to move the GUIDs out from GHES and place it in a common location for EFI/cper to share similar to protocol errors. Thanks, Smita >
Smita Koralahalli wrote: > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > On Wed, 20 Dec 2023 16:17:27 -0800 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > >> Series status/background > >> ======================== > >> > >> Smita has been a great help with this series. Thank you again! > >> > >> Smita's testing found that the GHES code ended up printing the events > >> twice. This version avoids the duplicate print by calling the callback > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > I'm not sure this is working as intended. > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > and now the EFI code handling that pretty printed things is missing we get > > the horrible kernel logging for an unknown block instead. > > > > So I think we need some minimal code in cper.c to match the guids then not > > log them (on basis we are arguing there is no need for new cper records). > > Otherwise we are in for some messy kernel logs > > > > Something like: > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > {1}[Hardware Error]: event severity: recoverable > > {1}[Hardware Error]: Error 0, type: recoverable > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > {1}[Hardware Error]: section length: 0x90 > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > (I'm filling the record with 0s currently) > > Yeah, when I tested this, I thought its okay for the hexdump to be there > in dmesg from EFI as the handling is done in trace events from GHES. > > If, we need to handle from EFI, then it would be a good reason to move > the GUIDs out from GHES and place it in a common location for EFI/cper > to share similar to protocol errors. Ah, yes, my expectation was more aligned with Jonathan's observation to do the processing in GHES code *and* skip the processing in the CPER code, something like: diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 56a5d2ef9e0a..e13e5fa4df4b 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -666,30 +666,6 @@ static cxl_cper_callback cper_callback; /* CXL Event record UUIDs are formatted as GUIDs and reported in section type */ -/* - * General Media Event Record - * CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 - */ -#define CPER_SEC_CXL_GEN_MEDIA_GUID \ - GUID_INIT(0xfbcd0a77, 0xc260, 0x417f, \ - 0x85, 0xa9, 0x08, 0x8b, 0x16, 0x21, 0xeb, 0xa6) - -/* - * DRAM Event Record - * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 - */ -#define CPER_SEC_CXL_DRAM_GUID \ - GUID_INIT(0x601dcbb3, 0x9c06, 0x4eab, \ - 0xb8, 0xaf, 0x4e, 0x9b, 0xfb, 0x5c, 0x96, 0x24) - -/* - * Memory Module Event Record - * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45 - */ -#define CPER_SEC_CXL_MEM_MODULE_GUID \ - GUID_INIT(0xfe927475, 0xdd59, 0x4339, \ - 0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74) - static void cxl_cper_post_event(enum cxl_event_type event_type, struct cxl_cper_event_rec *rec) { diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c index 35c37f667781..0a4eed470750 100644 --- a/drivers/firmware/efi/cper.c +++ b/drivers/firmware/efi/cper.c @@ -24,6 +24,7 @@ #include <linux/bcd.h> #include <acpi/ghes.h> #include <ras/ras_event.h> +#include <linux/cxl-event.h> #include "cper_cxl.h" /* @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata cper_print_prot_err(newpfx, prot_err); else goto err_section_too_small; + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { + printk("%ssection_type: CXL General Media Error\n", newpfx); + /* see: cxl_cper_event_call() */ + } else if (guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID)) { + printk("%ssection_type: CXL DRAM Error\n", newpfx); + /* see: cxl_cper_event_call() */ + } else if (guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) { + printk("%ssection_type: CXL Memory Module Error\n", newpfx); + /* see: cxl_cper_event_call() */ } else { const void *err = acpi_hest_get_payload(gdata); diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 17eadee819b6..6d9a7df88d4a 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -1,12 +1,31 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_CXL_EVENT_H #define _LINUX_CXL_EVENT_H +#include <linux/uuid.h> /* - * CXL event records; CXL rev 3.0 - * - * Copyright(c) 2023 Intel Corporation. + * General Media Event Record + * CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 + */ +#define CPER_SEC_CXL_GEN_MEDIA_GUID \ + GUID_INIT(0xfbcd0a77, 0xc260, 0x417f, \ + 0x85, 0xa9, 0x08, 0x8b, 0x16, 0x21, 0xeb, 0xa6) + +/* + * DRAM Event Record + * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 + */ +#define CPER_SEC_CXL_DRAM_GUID \ + GUID_INIT(0x601dcbb3, 0x9c06, 0x4eab, \ + 0xb8, 0xaf, 0x4e, 0x9b, 0xfb, 0x5c, 0x96, 0x24) + +/* + * Memory Module Event Record + * CXL rev 3.0 section 8.2.9.2.1.3; Table 8-45 */ +#define CPER_SEC_CXL_MEM_MODULE_GUID \ + GUID_INIT(0xfe927475, 0xdd59, 0x4339, \ + 0xa5, 0x86, 0x79, 0xba, 0xb1, 0x13, 0xb7, 0x74) struct cxl_event_record_hdr { u8 length;
Dan Williams wrote: > Smita Koralahalli wrote: > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > >> Series status/background > > >> ======================== > > >> > > >> Smita has been a great help with this series. Thank you again! > > >> > > >> Smita's testing found that the GHES code ended up printing the events > > >> twice. This version avoids the duplicate print by calling the callback > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > I'm not sure this is working as intended. > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > and now the EFI code handling that pretty printed things is missing we get > > > the horrible kernel logging for an unknown block instead. > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > log them (on basis we are arguing there is no need for new cper records). > > > Otherwise we are in for some messy kernel logs > > > > > > Something like: > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > {1}[Hardware Error]: event severity: recoverable > > > {1}[Hardware Error]: Error 0, type: recoverable > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > {1}[Hardware Error]: section length: 0x90 > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > (I'm filling the record with 0s currently) > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > If, we need to handle from EFI, then it would be a good reason to move > > the GUIDs out from GHES and place it in a common location for EFI/cper > > to share similar to protocol errors. > > Ah, yes, my expectation was more aligned with Jonathan's observation to > do the processing in GHES code *and* skip the processing in the CPER > code, something like: > Agreed this was intended I did not realize the above. > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 35c37f667781..0a4eed470750 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -24,6 +24,7 @@ > #include <linux/bcd.h> > #include <acpi/ghes.h> > #include <ras/ras_event.h> > +#include <linux/cxl-event.h> > #include "cper_cxl.h" > > /* > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > cper_print_prot_err(newpfx, prot_err); > else > goto err_section_too_small; > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > + printk("%ssection_type: CXL General Media Error\n", newpfx); Do we want the printk's here? I did not realize that a generic event would be printed. So intention was nothing would be done on this path. Ira
Ira Weiny wrote: > Dan Williams wrote: > > Smita Koralahalli wrote: > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > >> Series status/background > > > >> ======================== > > > >> > > > >> Smita has been a great help with this series. Thank you again! > > > >> > > > >> Smita's testing found that the GHES code ended up printing the events > > > >> twice. This version avoids the duplicate print by calling the callback > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > I'm not sure this is working as intended. > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > and now the EFI code handling that pretty printed things is missing we get > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > log them (on basis we are arguing there is no need for new cper records). > > > > Otherwise we are in for some messy kernel logs > > > > > > > > Something like: > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > {1}[Hardware Error]: event severity: recoverable > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > {1}[Hardware Error]: section length: 0x90 > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > (I'm filling the record with 0s currently) > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > to share similar to protocol errors. > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > do the processing in GHES code *and* skip the processing in the CPER > > code, something like: > > > > Agreed this was intended I did not realize the above. > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > index 35c37f667781..0a4eed470750 100644 > > --- a/drivers/firmware/efi/cper.c > > +++ b/drivers/firmware/efi/cper.c > > @@ -24,6 +24,7 @@ > > #include <linux/bcd.h> > > #include <acpi/ghes.h> > > #include <ras/ras_event.h> > > +#include <linux/cxl-event.h> > > #include "cper_cxl.h" > > > > /* > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > cper_print_prot_err(newpfx, prot_err); > > else > > goto err_section_too_small; > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > Do we want the printk's here? I did not realize that a generic event > would be printed. So intention was nothing would be done on this path. I think we do otherwise the kernel will say {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 {1}[Hardware Error]: event severity: recoverable {1}[Hardware Error]: Error 0, type: recoverable ... ..leaving the user hanging vs: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 {1}[Hardware Error]: event severity: recoverable {1}[Hardware Error]: Error 0, type: recoverable {1}[Hardware Error]: section type: General Media Error ..as an indicator to go follow up with rasdaemon or whatever else is doing the detailed monitoring of CXL events.
On Mon, 8 Jan 2024 18:59:16 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Ira Weiny wrote: > > Dan Williams wrote: > > > Smita Koralahalli wrote: > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > >> Series status/background > > > > >> ======================== > > > > >> > > > > >> Smita has been a great help with this series. Thank you again! > > > > >> > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > Something like: > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > to share similar to protocol errors. > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > do the processing in GHES code *and* skip the processing in the CPER > > > code, something like: > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > index 35c37f667781..0a4eed470750 100644 > > > --- a/drivers/firmware/efi/cper.c > > > +++ b/drivers/firmware/efi/cper.c > > > @@ -24,6 +24,7 @@ > > > #include <linux/bcd.h> > > > #include <acpi/ghes.h> > > > #include <ras/ras_event.h> > > > +#include <linux/cxl-event.h> > > > #include "cper_cxl.h" > > > > > > /* > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > cper_print_prot_err(newpfx, prot_err); > > > else > > > goto err_section_too_small; > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > Do we want the printk's here? I did not realize that a generic event > > would be printed. So intention was nothing would be done on this path. > > I think we do otherwise the kernel will say > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > {1}[Hardware Error]: event severity: recoverable > {1}[Hardware Error]: Error 0, type: recoverable > ... > > ...leaving the user hanging vs: > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > {1}[Hardware Error]: event severity: recoverable > {1}[Hardware Error]: Error 0, type: recoverable > {1}[Hardware Error]: section type: General Media Error > > ...as an indicator to go follow up with rasdaemon or whatever else is > doing the detailed monitoring of CXL events. Agreed. Maybe push it out to a static const table though. As the argument was that we shouldn't be spitting out big logs in this modern world, let's make it easy for people to add more entries. struct skip_me { guid_t guid; const char *name; }; static const struct skip_me skip_me = { { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, etc. }; for (i = 0; i < ARRAY_SIZE(skip_me); i++) { if (guid_equal(sec_type, skip_me[i].guid)) { printk("%asection_type: %s\n", newpfx, skip_me[i].name); break; } or something like that in the final else.
Jonathan Cameron wrote: > On Mon, 8 Jan 2024 18:59:16 -0800 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Ira Weiny wrote: > > > Dan Williams wrote: > > > > Smita Koralahalli wrote: > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > >> Series status/background > > > > > >> ======================== > > > > > >> > > > > > >> Smita has been a great help with this series. Thank you again! > > > > > >> > > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > > > Something like: > > > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > > to share similar to protocol errors. > > > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > > do the processing in GHES code *and* skip the processing in the CPER > > > > code, something like: > > > > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > > index 35c37f667781..0a4eed470750 100644 > > > > --- a/drivers/firmware/efi/cper.c > > > > +++ b/drivers/firmware/efi/cper.c > > > > @@ -24,6 +24,7 @@ > > > > #include <linux/bcd.h> > > > > #include <acpi/ghes.h> > > > > #include <ras/ras_event.h> > > > > +#include <linux/cxl-event.h> > > > > #include "cper_cxl.h" > > > > > > > > /* > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > > cper_print_prot_err(newpfx, prot_err); > > > > else > > > > goto err_section_too_small; > > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > > > Do we want the printk's here? I did not realize that a generic event > > > would be printed. So intention was nothing would be done on this path. > > > > I think we do otherwise the kernel will say > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > {1}[Hardware Error]: event severity: recoverable > > {1}[Hardware Error]: Error 0, type: recoverable > > ... > > > > ...leaving the user hanging vs: > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > {1}[Hardware Error]: event severity: recoverable > > {1}[Hardware Error]: Error 0, type: recoverable > > {1}[Hardware Error]: section type: General Media Error > > > > ...as an indicator to go follow up with rasdaemon or whatever else is > > doing the detailed monitoring of CXL events. > > Agreed. Maybe push it out to a static const table though. > As the argument was that we shouldn't be spitting out big logs in this > modern world, let's make it easy for people to add more entries. > > struct skip_me { > guid_t guid; > const char *name; > }; > static const struct skip_me skip_me = { > { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, > etc. > }; > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) { > if (guid_equal(sec_type, skip_me[i].guid)) { > printk("%asection_type: %s\n", newpfx, skip_me[i].name); > break; > } > > or something like that in the final else. I like it. Any concerns with that being an -rc fixup, and move ahead with the base enabling for v6.8? I don't see that follow-on as a reason to push the whole thing to v6.9.
Dan Williams wrote: > Jonathan Cameron wrote: > > On Mon, 8 Jan 2024 18:59:16 -0800 > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > Ira Weiny wrote: > > > > Dan Williams wrote: > > > > > Smita Koralahalli wrote: > > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > > >> Series status/background > > > > > > >> ======================== > > > > > > >> > > > > > > >> Smita has been a great help with this series. Thank you again! > > > > > > >> > > > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > > > > > Something like: > > > > > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > > > to share similar to protocol errors. > > > > > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > > > do the processing in GHES code *and* skip the processing in the CPER > > > > > code, something like: > > > > > > > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > > > index 35c37f667781..0a4eed470750 100644 > > > > > --- a/drivers/firmware/efi/cper.c > > > > > +++ b/drivers/firmware/efi/cper.c > > > > > @@ -24,6 +24,7 @@ > > > > > #include <linux/bcd.h> > > > > > #include <acpi/ghes.h> > > > > > #include <ras/ras_event.h> > > > > > +#include <linux/cxl-event.h> > > > > > #include "cper_cxl.h" > > > > > > > > > > /* > > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > > > cper_print_prot_err(newpfx, prot_err); > > > > > else > > > > > goto err_section_too_small; > > > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > > > > > Do we want the printk's here? I did not realize that a generic event > > > > would be printed. So intention was nothing would be done on this path. > > > > > > I think we do otherwise the kernel will say > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > {1}[Hardware Error]: event severity: recoverable > > > {1}[Hardware Error]: Error 0, type: recoverable > > > ... > > > > > > ...leaving the user hanging vs: > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > {1}[Hardware Error]: event severity: recoverable > > > {1}[Hardware Error]: Error 0, type: recoverable > > > {1}[Hardware Error]: section type: General Media Error > > > > > > ...as an indicator to go follow up with rasdaemon or whatever else is > > > doing the detailed monitoring of CXL events. > > > > Agreed. Maybe push it out to a static const table though. > > As the argument was that we shouldn't be spitting out big logs in this > > modern world, let's make it easy for people to add more entries. > > > > struct skip_me { > > guid_t guid; > > const char *name; > > }; > > static const struct skip_me skip_me = { > > { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, > > etc. > > }; > > > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) { > > if (guid_equal(sec_type, skip_me[i].guid)) { > > printk("%asection_type: %s\n", newpfx, skip_me[i].name); > > break; > > } > > > > or something like that in the final else. > > I like it. > > Any concerns with that being an -rc fixup, and move ahead with the base > enabling for v6.8? I don't see that follow-on as a reason to push the > whole thing to v6.9. I will put it in -next for soak time and make an inclusion decision in a few days after I hear back.
On Wed, 10 Jan 2024 at 00:30, Dan Williams <dan.j.williams@intel.com> wrote: > > Dan Williams wrote: > > Jonathan Cameron wrote: > > > On Mon, 8 Jan 2024 18:59:16 -0800 > > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > Ira Weiny wrote: > > > > > Dan Williams wrote: > > > > > > Smita Koralahalli wrote: > > > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > > > > >> Series status/background > > > > > > > >> ======================== > > > > > > > >> > > > > > > > >> Smita has been a great help with this series. Thank you again! > > > > > > > >> > > > > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > > > > > > > Something like: > > > > > > > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > > > > to share similar to protocol errors. > > > > > > > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > > > > do the processing in GHES code *and* skip the processing in the CPER > > > > > > code, something like: > > > > > > > > > > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > > > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > > > > index 35c37f667781..0a4eed470750 100644 > > > > > > --- a/drivers/firmware/efi/cper.c > > > > > > +++ b/drivers/firmware/efi/cper.c > > > > > > @@ -24,6 +24,7 @@ > > > > > > #include <linux/bcd.h> > > > > > > #include <acpi/ghes.h> > > > > > > #include <ras/ras_event.h> > > > > > > +#include <linux/cxl-event.h> > > > > > > #include "cper_cxl.h" > > > > > > > > > > > > /* > > > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > > > > cper_print_prot_err(newpfx, prot_err); > > > > > > else > > > > > > goto err_section_too_small; > > > > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > > > > > > > Do we want the printk's here? I did not realize that a generic event > > > > > would be printed. So intention was nothing would be done on this path. > > > > > > > > I think we do otherwise the kernel will say > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > {1}[Hardware Error]: event severity: recoverable > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > ... > > > > > > > > ...leaving the user hanging vs: > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > {1}[Hardware Error]: event severity: recoverable > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > {1}[Hardware Error]: section type: General Media Error > > > > > > > > ...as an indicator to go follow up with rasdaemon or whatever else is > > > > doing the detailed monitoring of CXL events. > > > > > > Agreed. Maybe push it out to a static const table though. > > > As the argument was that we shouldn't be spitting out big logs in this > > > modern world, let's make it easy for people to add more entries. > > > > > > struct skip_me { > > > guid_t guid; > > > const char *name; > > > }; > > > static const struct skip_me skip_me = { > > > { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, > > > etc. > > > }; > > > > > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) { > > > if (guid_equal(sec_type, skip_me[i].guid)) { > > > printk("%asection_type: %s\n", newpfx, skip_me[i].name); > > > break; > > > } > > > > > > or something like that in the final else. > > > > I like it. > > > > Any concerns with that being an -rc fixup, and move ahead with the base > > enabling for v6.8? I don't see that follow-on as a reason to push the > > whole thing to v6.9. > > I will put it in -next for soak time and make an inclusion decision in a > few days after I hear back. > For the series and however you want to handle the merge: Acked-by: Ard Biesheuvel <ardb@kernel.org>
On Wed, 10 Jan 2024 00:31:17 +0100 Ard Biesheuvel <ardb@kernel.org> wrote: > On Wed, 10 Jan 2024 at 00:30, Dan Williams <dan.j.williams@intel.com> wrote: > > > > Dan Williams wrote: > > > Jonathan Cameron wrote: > > > > On Mon, 8 Jan 2024 18:59:16 -0800 > > > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > > > Ira Weiny wrote: > > > > > > Dan Williams wrote: > > > > > > > Smita Koralahalli wrote: > > > > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > > > > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > > > > > > >> Series status/background > > > > > > > > >> ======================== > > > > > > > > >> > > > > > > > > >> Smita has been a great help with this series. Thank you again! > > > > > > > > >> > > > > > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > > > > > > > > > Something like: > > > > > > > > > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > > > > > to share similar to protocol errors. > > > > > > > > > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > > > > > do the processing in GHES code *and* skip the processing in the CPER > > > > > > > code, something like: > > > > > > > > > > > > > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > > > > > index 35c37f667781..0a4eed470750 100644 > > > > > > > --- a/drivers/firmware/efi/cper.c > > > > > > > +++ b/drivers/firmware/efi/cper.c > > > > > > > @@ -24,6 +24,7 @@ > > > > > > > #include <linux/bcd.h> > > > > > > > #include <acpi/ghes.h> > > > > > > > #include <ras/ras_event.h> > > > > > > > +#include <linux/cxl-event.h> > > > > > > > #include "cper_cxl.h" > > > > > > > > > > > > > > /* > > > > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > > > > > cper_print_prot_err(newpfx, prot_err); > > > > > > > else > > > > > > > goto err_section_too_small; > > > > > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > > > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > > > > > > > > > Do we want the printk's here? I did not realize that a generic event > > > > > > would be printed. So intention was nothing would be done on this path. > > > > > > > > > > I think we do otherwise the kernel will say > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > ... > > > > > > > > > > ...leaving the user hanging vs: > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > {1}[Hardware Error]: section type: General Media Error > > > > > > > > > > ...as an indicator to go follow up with rasdaemon or whatever else is > > > > > doing the detailed monitoring of CXL events. > > > > > > > > Agreed. Maybe push it out to a static const table though. > > > > As the argument was that we shouldn't be spitting out big logs in this > > > > modern world, let's make it easy for people to add more entries. > > > > > > > > struct skip_me { > > > > guid_t guid; > > > > const char *name; > > > > }; > > > > static const struct skip_me skip_me = { > > > > { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, > > > > etc. > > > > }; > > > > > > > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) { > > > > if (guid_equal(sec_type, skip_me[i].guid)) { > > > > printk("%asection_type: %s\n", newpfx, skip_me[i].name); > > > > break; > > > > } > > > > > > > > or something like that in the final else. > > > > > > I like it. > > > > > > Any concerns with that being an -rc fixup, and move ahead with the base > > > enabling for v6.8? I don't see that follow-on as a reason to push the > > > whole thing to v6.9. > > > > I will put it in -next for soak time and make an inclusion decision in a > > few days after I hear back. > > > > For the series and however you want to handle the merge: > > Acked-by: Ard Biesheuvel <ardb@kernel.org> Any path in works for me as well. J