[RFC,1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

Message ID 20221206153354.92394-2-xueshuai@linux.alibaba.com
State New
Headers
Series [RFC,1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events |

Commit Message

Shuai Xue Dec. 6, 2022, 3:33 p.m. UTC
  There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes
  the memory. OS requires to take action (for example, offline failure
  page/kill failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution
  context. Some data in the memory are corrupted. But the data have not
  been consumed. OS is optional to take action to recover this
  uncorrectable error.

For X86 platforms, we can easily distinguish between these two types based
on the MCA Bank. While for arm64 platform, the memory failure flags for all
UCs which severity are GHES_SEV_RECOVERABLE are set as 0, a.k.a, Action
Optional now. Set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/acpi/apei/ghes.c |  2 +-
 include/linux/cper.h     | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)
  

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9952f3a792ba..a420759fce2d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -475,7 +475,7 @@  static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags = MF_SOFT_OFFLINE;
 	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = 0;
+		flags = (gdata->flags & CPER_SEC_SYNC) ? MF_ACTION_REQUIRED : 0;
 
 	if (flags != -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..a3571fa8a73d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -144,6 +144,28 @@  enum {
  * corrective action before the data is consumed
  */
 #define CPER_SEC_LATENT_ERROR			0x0020
+/*
+ * If set, the section is to be associated with an error that has been
+ * propagated due to hardware poisoning. This implies the error is a symptom of
+ * another error. It is not always possible to ascertain whether this is the
+ * case for an error, therefore if the flag is not set, it is unknown whether
+ * the error was propagated. this helps determining FRU when dealing with HW
+ * failures
+ */
+#define CPER_SEC_PROPAGATED                    0x0040
+/*
+ * If set this flag indicates the firmware has detected an overflow of
+ * buffers/queues that are used to accumulate, collect, or report errors (e.g.
+ * the error status control block exposed to the OS). When this occurs, some
+ * error records may be lost.
+ */
+#define CPER_SEC_OVERFLOW                      0x0080
+/*
+ * If set, it indicates that this event record is synchronous(e.g. cpu core
+ * consumes poison data, then cause instruction/data abort); if not set,
+ * this event record is asynchronous.
+ */
+#define CPER_SEC_SYNC                          0x00100
 
 /*
  * Section type definitions, used in section_type field in struct