EDAC/amd64: Add MI300 row retirement support

Message ID 20240204155106.3110934-1-yazen.ghannam@amd.com
State New
Headers
Series EDAC/amd64: Add MI300 row retirement support |

Commit Message

Yazen Ghannam Feb. 4, 2024, 3:51 p.m. UTC
  AMD MI300 systems have on-die High Bandwidth Memory. This memory has a
relatively higher error rate, and it is not individually replaceable
like DIMMs.

Uncorrectable ECC errors are individually reported as Deferred errors
using the AMD Deferred error interrupt. Each reported error corresponds
to a single hardware error.

Correctable ECC errors may reported in batches through MCA Thresholding.
Users can configure the threshold limit based on their policy. Each
reported Correctable error represents a single occurrence of the
threshold limit being reached.

The current guidance from AMD designers is that memory affected by ECC
errors within a DRAM row should be retired. Action should be taken on
every reported ECC error.

Add a helper function to apply this policy for MI300 systems.

This and similar functionality may be best handled in a separate,
generic module. In the meantime, do this in AMD64 EDAC for simplicity.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Notes:

This is a complete rewrite of the following patch:
https://lore.kernel.org/r/20231129073521.2127403-7-muralimk@amd.com

I'd like to include Murali as co-developer, since this is based on his
work.

The remaining MI300 RAS work will be focused on saving and restoring bad
memory information across reboots. The latest set on the mailing list is
here:
https://lore.kernel.org/r/20231129075034.2159223-1-muralimk@amd.com

 drivers/edac/Kconfig      |  1 +
 drivers/edac/amd64_edac.c | 48 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)
  

Comments

M K, Muralidhara Feb. 5, 2024, 7:10 a.m. UTC | #1
On 2/4/2024 9:21 PM, Yazen Ghannam wrote:
> https://lore.kernel.org/r/20231129073521.2127403-7-muralimk@amd.com
> 
> I'd like to include Murali as co-developer, since this is based on his
> work.

Thanks for the flow. I have tested and row retirement feature is working 
as expected. Please include Co-developed-by:Muralidhara M K 
<muralidhara.mk@amd.com>
  
Borislav Petkov Feb. 5, 2024, 1:05 p.m. UTC | #2
On Sun, Feb 04, 2024 at 09:51:06AM -0600, Yazen Ghannam wrote:
> AMD MI300 systems have on-die High Bandwidth Memory. This memory has a
> relatively higher error rate, and it is not individually replaceable
> like DIMMs.

..

>  drivers/edac/Kconfig      |  1 +
>  drivers/edac/amd64_edac.c | 48 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)

Applied, thanks.
  

Patch

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 16c8de5050e5..8b147403c955 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -78,6 +78,7 @@  config EDAC_GHES
 config EDAC_AMD64
 	tristate "AMD64 (Opteron, Athlon64)"
 	depends on AMD_NB && EDAC_DECODE_MCE
+	depends on MEMORY_FAILURE
 	imply AMD_ATL
 	help
 	  Support for error detection and correction of DRAM ECC errors on
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ca9a8641652d..ee2f3ff15ab7 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2795,6 +2795,51 @@  static void umc_get_err_info(struct mce *m, struct err_info *err)
 	err->csrow = m->synd & 0x7;
 }
 
+/*
+ * When a DRAM ECC error occurs on MI300 systems, it is recommended to retire
+ * all memory within that DRAM row. This applies to the memory with a DRAM
+ * bank.
+ *
+ * To find the memory addresses, loop through permutations of the DRAM column
+ * bits and find the System Physical address of each. The column bits are used
+ * to calculate the intermediate Normalized address, so all permutations should
+ * be checked.
+ *
+ * See amd_atl::convert_dram_to_norm_addr_mi300() for MI300 address formats.
+ */
+#define MI300_UMC_MCA_COL	GENMASK(5, 1)
+#define MI300_NUM_COL		BIT(HWEIGHT(MI300_UMC_MCA_COL))
+static void retire_row_mi300(struct atl_err *a_err)
+{
+	unsigned long addr;
+	struct page *p;
+	u8 col;
+
+	for (col = 0; col < MI300_NUM_COL; col++) {
+		a_err->addr &= ~MI300_UMC_MCA_COL;
+		a_err->addr |= FIELD_PREP(MI300_UMC_MCA_COL, col);
+
+		addr = amd_convert_umc_mca_addr_to_sys_addr(a_err);
+		if (IS_ERR_VALUE(addr))
+			continue;
+
+		addr = PHYS_PFN(addr);
+
+		/*
+		 * Skip invalid or already poisoned pages to avoid unnecessary
+		 * error messages from memory_failure().
+		 */
+		p = pfn_to_online_page(addr);
+		if (!p)
+			continue;
+
+		if (PageHWPoison(p))
+			continue;
+
+		memory_failure(addr, 0);
+	}
+}
+
 static void decode_umc_error(int node_id, struct mce *m)
 {
 	u8 ecc_type = (m->status >> 45) & 0x3;
@@ -2845,6 +2890,9 @@  static void decode_umc_error(int node_id, struct mce *m)
 
 	error_address_to_page_and_offset(sys_addr, &err);
 
+	if (pvt->fam == 0x19 && pvt->dram_type == MEM_HBM3)
+		retire_row_mi300(&a_err);
+
 log_error:
 	__log_ecc_error(mci, &err, ecc_type);
 }