[2/2] RAS: Introduce the FRU Memory Poison Manager

Message ID 20240214033516.1344948-3-yazen.ghannam@amd.com
State New
Headers
Series FRU Memory Poison Manager |

Commit Message

Yazen Ghannam Feb. 14, 2024, 3:35 a.m. UTC
  Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations is acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in the kernel memory manager, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle. Running jobs may be
terminated due to previously known bad memory. This could be prevented
if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver may save memory error information to the
persistent storage during run time. The information may then be restored
after reset, and known bad memory may be retired before use. A running
log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should
leverage industry-standard components, as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

A UEFI CPER "FRU Memory Poison Section" is being proposed, along with a
"Memory Poison Descriptor", to use for this purpose. These new structures
are minimal, saving space on limited non-volatile memory, and extensible.

CPER-aware persistent storage interfaces, like ACPI ERST and EFI Runtime
Variables, can be used. A new interface is not required.

Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 MAINTAINERS            |   7 +
 drivers/ras/Kconfig    |  13 +
 drivers/ras/Makefile   |   1 +
 drivers/ras/amd/fmpm.c | 776 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 797 insertions(+)
 create mode 100644 drivers/ras/amd/fmpm.c
  

Comments

Borislav Petkov Feb. 14, 2024, 9:06 a.m. UTC | #1
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> Memory errors are an expected occurrence on systems with high memory
> density. Generally, errors within a small number of unique physical
> locations is acceptable, based on manufacturer and/or admin policy.
> During run time, memory with errors may be retired so it is no longer
> used by the system. This is done in the kernel memory manager, and the
> effect will remain until the system is restarted.
> 
> If a memory location is consistently faulty, then the same run time
> error handling may occur in the next reboot cycle. Running jobs may be
> terminated due to previously known bad memory. This could be prevented
> if information from the previous boot was not lost.
> 
> Some add-in cards with driver-managed memory have on-board persistent
> storage. Their driver may save memory error information to the
> persistent storage during run time. The information may then be restored
> after reset, and known bad memory may be retired before use. A running
> log of bad memory locations is kept across multiple resets.

Too many "may"s above, please tone them down.

> A similar solution is desirable for CPUs. However, this solution should

GPUs you mean?

> leverage industry-standard components, as much as possible, rather than
> a bespoke platform driver.
> 
> Two components are needed: a record format and a persistent storage
> interface.
> 
> A UEFI CPER "FRU Memory Poison Section" is being proposed, along with a
> "Memory Poison Descriptor", to use for this purpose. These new structures
> are minimal, saving space on limited non-volatile memory, and extensible.
> 
> CPER-aware persistent storage interfaces, like ACPI ERST and EFI Runtime
> Variables, can be used. A new interface is not required.

I don't think stuff which is being proposed belongs here.

> Implement a new module to manage the record formats on persistent
> storage. Use the requirements for an AMD MI300-based system to start.
> Vendor- and platform-specific details can be abstracted later as needed.

This is a big diff so I'm splitting mails.

Thx.
  
Borislav Petkov Feb. 14, 2024, 9:28 a.m. UTC | #2
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +config RAS_FMPM
> +	tristate "FRU Memory Poison Manager"
> +	default m
> +	depends on X86_MCE

I know this is generic-ish but it needs to be enabled only on AMD for
now. Whoever wants it somewhere else, then whoever needs to test it
there first and then enable it there.

> +	imply AMD_ATL
> +	help
> +	  Support saving and restoring memory error information across reboot
> +	  cycles using ACPI ERST as persistent storage. Error information is

s/cycles//

> +	  saved with the UEFI CPER "FRU Memory Poison" section format.
> +
> +	  Memory may be retired during boot time and run time depending on

s/may/is/

Please check all your text - too many "may"s for something which is not
a vendor doc. :)

> +	  platform-specific policies.
> +
>  endif
> diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
> index 3fac80f58005..11f95d59d397 100644
> --- a/drivers/ras/Makefile
> +++ b/drivers/ras/Makefile
> @@ -3,4 +3,5 @@ obj-$(CONFIG_RAS)	+= ras.o
>  obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
>  obj-$(CONFIG_RAS_CEC)	+= cec.o
>  
> +obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
>  obj-y			+= amd/atl/
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> new file mode 100644
> index 000000000000..077d9f35cc7d
> --- /dev/null
> +++ b/drivers/ras/amd/fmpm.c
> @@ -0,0 +1,776 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * FRU (Field-Replaceable Unit) Memory Poison Manager
> + *
> + * Copyright (c) 2024, Advanced Micro Devices, Inc.
> + * All Rights Reserved.
> + *
> + * Authors:
> + *	Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
> + *	Muralidhara M K <muralidhara.mk@amd.com>
> + *	Yazen Ghannam <Yazen.Ghannam@amd.com>
> + *
> + * Implementation notes, assumptions, and limitations:
> + *
> + * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
> + *   included in the UEFI specification. So they are defined here. Afterwards, they
> + *   may be moved to linux/cper.h, if appropriate.
> + *
> + * - Platforms based on AMD MI300 systems will be the first to use these structures.
> + *   There are a number of assumptions made here that will need to be generalized
> + *   to support other platforms.
> + *
> + *   AMD MI300-based platform(s) assumptions:
> + *   - Memory errors are reported through x86 MCA.
> + *   - The entire DRAM row containing a memory error should be retired.
> + *   - There will be (1) FRU Memory Poison Section per CPER.
> + *   - The FRU will be the CPU Package (Processor Socket).
> + *   - The default number of Memory Poison Descriptor entries should be (8).
> + *   - The Platform will use ACPI ERST for persistent storage.
> + *   - All FRU records should be saved to persistent storage. Module init will
> + *     fail if any FRU record is not successfully written.

Please drop all that capitalized spelling.

> + * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
> + *   to use this module for other vendors.

This is not needed.

> + * - Boot time memory retirement may occur later than ideal due to dependencies
> + *   on other libraries and drivers. This leaves a gap where bad memory may be
> + *   accessed during early boot stages.
> + *
> + * - Enough memory should be pre-allocated for each FRU record to be able to hold
> + *   the expected number of descriptor entries. This, mostly empty, record is
> + *   written to storage during init time. Subsequent writes to the same record
> + *   should allow the Platform to update the stored record in-place. Otherwise,
> + *   if the record is extended, then the Platform may need to perform costly memory
> + *   management operations on the storage. For example, the Platform may spend time
> + *   in Firmware copying and invalidating memory on a relatively slow SPI ROM.

That's a good thing to have here.

> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/cper.h>
> +#include <linux/ras.h>
> +
> +#include <acpi/apei.h>
> +
> +#include <asm/cpu_device_id.h>
> +#include <asm/mce.h>
> +
> +#pragma pack(1)

Is that some ugly thing to avoid adding __packed annotation to the
structure definitions below?

"GCC supports several types of pragmas, primarily in order to compile
code originally written for other compilers. Note that in general we do
not recommend the use of pragmas; See Declaring Attributes of Functions,
for further explanation. "

Oh, that 1 is something else:

-fpack-struct[=n]

    Without a value specified, pack all structure members together
    without holes. When a value is specified (which must be a small
    power of two), pack structure members according to this value,
    representing the maximum alignment (that is, objects with default
    alignment requirements larger than this are output potentially
    unaligned at the next fitting location.

So do I understand it correctly that struct members should be aligned to
2^1 bytes?

Grepping the tree, this looks like something BIOS does...
  
Borislav Petkov Feb. 14, 2024, 10:29 a.m. UTC | #3
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +/* FRU Memory Poison Section, UEFI vX.Y sec N.X.Z */

Whack those:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 328e0a962c23..0246b13b5ba1 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -72,7 +72,7 @@
 /* FRU ID Types */
 #define FMP_ID_TYPE_X86_PPIN		0
 
-/* FRU Memory Poison Section, UEFI vX.Y sec N.X.Z */
+/* FRU Memory Poison Section */
 struct cper_sec_fru_mem_poison {
 	u32 checksum;
 	u64 validation_bits;
@@ -89,7 +89,7 @@ struct cper_sec_fru_mem_poison {
 /* FRU Descriptor Address Types */
 #define FPD_ADDR_TYPE_MCA_ADDR		0
 
-/* Memory Poison Descriptor, UEFI vX.Y sec N.X.Y */
+/* Memory Poison Descriptor */
 struct cper_fru_poison_desc {
 	u64 timestamp;
 	u32 hw_id_type;


> +/**
> + * DOC: fru_poison_entries (byte)
> + * Maximum number of descriptor entries possible for each FRU.
> + *
> + * Values between '1' and '255' are valid.
> + * No input or '0' will default to FMPM_DEFAULT_MAX_NR_ENTRIES.
> + */
> +static u8 max_nr_entries;
> +module_param(max_nr_entries, byte, 0644);
> +MODULE_PARM_DESC(max_nr_entries,
> +		 "Maximum number of memory poison descriptor entries per FRU");

Why is there a module parameter?

So that people can brick their BIOSes if it can't handle some size?

Can we read out the max size of the area destined for FRU records from
somewhere and go with it?

> +#define FMPM_DEFAULT_MAX_NR_ENTRIES	8
> +
> +/* Maximum number of FRUs in the system. */
> +static unsigned int max_nr_fru;
> +
> +/* Total length of record including headers and list of descriptor entries. */
> +static size_t max_rec_len;
> +
> +/*
> + * Protect the local cache and prevent concurrent writes to storage.

"local cache"?

> + * This is only needed after init once notifier block registration is done.
> + */
> +static DEFINE_MUTEX(fmpm_update_mutex);
> +
> +#define for_each_fru(i, rec) \
> +	for (i = 0; rec = fru_records[i], i < max_nr_fru; i++)
  
Borislav Petkov Feb. 14, 2024, 10:34 a.m. UTC | #4
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +#include <asm/mce.h>

..

> +static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
								^^^^^^


This is a generic thing and thus can't use an x86-ism struct mce,
remember?
  
Borislav Petkov Feb. 14, 2024, 10:51 a.m. UTC | #5
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static inline struct cper_sec_fru_mem_poison *get_fmp(struct fru_rec *rec)
> +{
> +	return &rec->fmp;
> +}

Let's get rid of those silly helpers, diff for this one below.

The logic is, you pass around struct fru_rec *rec and inside the
functions you deref what you need.

Thx.

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 0246b13b5ba1..9eaf892e35b9 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -153,11 +153,6 @@ static DEFINE_MUTEX(fmpm_update_mutex);
 #define for_each_fru(i, rec) \
 	for (i = 0; rec = fru_records[i], i < max_nr_fru; i++)
 
-static inline struct cper_sec_fru_mem_poison *get_fmp(struct fru_rec *rec)
-{
-	return &rec->fmp;
-}
-
 static inline struct cper_fru_poison_desc *get_fpd(struct fru_rec *rec, u32 entry)
 {
 	return &rec->entries[entry];
@@ -174,7 +169,7 @@ static struct fru_rec *get_fru_record(u64 fru_id)
 	unsigned int i;
 
 	for_each_fru(i, rec) {
-		if (get_fmp(rec)->fru_id == fru_id)
+		if (rec->fmp.fru_id == fru_id)
 			return rec;
 	}
 
@@ -203,16 +198,15 @@ static u32 do_fmp_checksum(struct cper_sec_fru_mem_poison *fmp, u32 len)
 /* Calculate a new checksum. */
 static u32 get_fmp_checksum(struct fru_rec *rec)
 {
-	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
 	u32 len, checksum;
 
 	len = get_fmp_len(rec);
 
 	/* Get the current total. */
-	checksum = do_fmp_checksum(fmp, len);
+	checksum = do_fmp_checksum(&rec->fmp, len);
 
 	/* Subtract the current checksum from total. */
-	checksum -= fmp->checksum;
+	checksum -= rec->fmp.checksum;
 
 	/* Return the compliment value. */
 	return 0 - checksum;
@@ -244,12 +238,12 @@ static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
 	fpd->addr	= m->addr;
 }
 
-static bool has_valid_entries(u64 valid_bits)
+static bool has_valid_entries(struct fru_rec *rec)
 {
-	if (!(valid_bits & FMP_VALID_LIST_ENTRIES))
+	if (!(rec->fmp.validation_bits & FMP_VALID_LIST_ENTRIES))
 		return false;
 
-	if (!(valid_bits & FMP_VALID_LIST))
+	if (!(rec->fmp.validation_bits & FMP_VALID_LIST))
 		return false;
 
 	return true;
@@ -282,7 +276,7 @@ static bool is_dup_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *new)
 {
 	unsigned int i;
 
-	for (i = 0; i < get_fmp(rec)->nr_entries; i++) {
+	for (i = 0; i < rec->fmp.nr_entries; i++) {
 		if (same_fpd(get_fpd(rec, i), new)) {
 			pr_debug("Found duplicate record");
 			return true;
@@ -294,7 +288,7 @@ static bool is_dup_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *new)
 
 static void update_fru_record(struct fru_rec *rec, struct mce *m)
 {
-	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
+	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
 	struct cper_fru_poison_desc fpd;
 	u32 entry = 0;
 
@@ -303,15 +297,15 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 	init_fpd(&fpd, m);
 
 	/* This is the first entry, so just save it. */
-	if (!has_valid_entries(fmp->validation_bits))
+	if (!has_valid_entries(rec))
 		goto save_fpd;
 
 	/* Ignore already recorded errors. */
 	if (is_dup_fpd(rec, &fpd))
 		goto out_unlock;
 
-	if (fmp->nr_entries >= max_nr_entries) {
-		pr_warn("Exceeded number of entries for FRU 0x%016llx", fmp->fru_id);
+	if (rec->fmp.nr_entries >= max_nr_entries) {
+		pr_warn("Exceeded number of entries for FRU 0x%016llx", rec->fmp.fru_id);
 		goto out_unlock;
 	}
 
@@ -412,9 +406,9 @@ static void retire_mem_records(void)
 	u32 cpu;
 
 	for_each_fru(i, rec) {
-		fmp = get_fmp(rec);
+		fmp = &rec->fmp;
 
-		if (!has_valid_entries(fmp->validation_bits))
+		if (!has_valid_entries(rec))
 			continue;
 
 		cpu = get_cpu_from_fru_id(fmp->fru_id);
@@ -481,7 +475,7 @@ static int save_new_records(void)
 
 static bool is_valid_fmp(struct fru_rec *rec)
 {
-	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
+	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
 	u32 len = get_fmp_len(rec);
 
 	if (!fmp)
@@ -602,8 +596,10 @@ static int get_saved_records(void)
 	return ret;
 }
 
-static void set_fmp_fields(struct cper_sec_fru_mem_poison *fmp, unsigned int cpu)
+static void set_fmp_fields(struct fru_rec *rec, unsigned int cpu)
 {
+	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
+
 	fmp->fru_arch_type    = FMP_ARCH_TYPE_X86_CPUID_1_EAX;
 	fmp->validation_bits |= FMP_VALID_ARCH_TYPE;
 
@@ -638,7 +634,7 @@ static void init_fmps(void)
 
 	for_each_fru(i, rec) {
 		cpu = get_cpu_for_fru_num(i);
-		set_fmp_fields(get_fmp(rec), cpu);
+		set_fmp_fields(rec, cpu);
 	}
 }
  
Borislav Petkov Feb. 14, 2024, 11:05 a.m. UTC | #6
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static inline struct cper_fru_poison_desc *get_fpd(struct fru_rec *rec, u32 entry)
> +{
> +	return &rec->entries[entry];
> +}

This one needs to go too.

> +static inline u32 get_fmp_len(struct fru_rec *rec)
> +{
> +	return rec->sec_desc.section_length - sizeof(struct cper_section_descriptor);
> +}

Oh well, I guess we can keep that one.

> +/* Calculate a new checksum. */
> +static u32 get_fmp_checksum(struct fru_rec *rec)
> +{
> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
> +	u32 len, checksum;
> +
> +	len = get_fmp_len(rec);
> +
> +	/* Get the current total. */
> +	checksum = do_fmp_checksum(fmp, len);
> +
> +	/* Subtract the current checksum from total. */
> +	checksum -= fmp->checksum;
> +
> +	/* Return the compliment value. */
> +	return 0 - checksum;
> +}

Let's get rid of that one.

Also, I think it is called *complement* value and you simply do

        /* Use the complement value. */
        rec->fmp.checksum = -checksum;

I'd say.

---

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 9eaf892e35b9..f8799beddcc4 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -195,11 +195,12 @@ static u32 do_fmp_checksum(struct cper_sec_fru_mem_poison *fmp, u32 len)
 	return checksum;
 }
 
-/* Calculate a new checksum. */
-static u32 get_fmp_checksum(struct fru_rec *rec)
+static int update_record_on_storage(struct fru_rec *rec)
 {
 	u32 len, checksum;
+	int ret;
 
+	/* Calculate a new checksum. */
 	len = get_fmp_len(rec);
 
 	/* Get the current total. */
@@ -208,15 +209,8 @@ static u32 get_fmp_checksum(struct fru_rec *rec)
 	/* Subtract the current checksum from total. */
 	checksum -= rec->fmp.checksum;
 
-	/* Return the compliment value. */
-	return 0 - checksum;
-}
-
-static int update_record_on_storage(struct fru_rec *rec)
-{
-	int ret;
-
-	rec->fmp.checksum = get_fmp_checksum(rec);
+	/* Use the complement value. */
+	rec->fmp.checksum = -checksum;
 
 	pr_debug("Writing to storage");
  
Borislav Petkov Feb. 14, 2024, 11:10 a.m. UTC | #7
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
> +{
> +	memset(fpd, 0, sizeof(struct cper_fru_poison_desc));
> +
> +	fpd->timestamp	= m->time;
> +	fpd->hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
> +	fpd->hw_id	= m->ipid;
> +	fpd->addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
> +	fpd->addr	= m->addr;
> +}

Get rid of that one:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index f8799beddcc4..090b60d269e7 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -221,17 +221,6 @@ static int update_record_on_storage(struct fru_rec *rec)
 	return ret;
 }
 
-static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
-{
-	memset(fpd, 0, sizeof(struct cper_fru_poison_desc));
-
-	fpd->timestamp	= m->time;
-	fpd->hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
-	fpd->hw_id	= m->ipid;
-	fpd->addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
-	fpd->addr	= m->addr;
-}
-
 static bool has_valid_entries(struct fru_rec *rec)
 {
 	if (!(rec->fmp.validation_bits & FMP_VALID_LIST_ENTRIES))
@@ -288,7 +277,13 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 
 	mutex_lock(&fmpm_update_mutex);
 
-	init_fpd(&fpd, m);
+	memset(&fpd, 0, sizeof(struct cper_fru_poison_desc));
+
+	fpd.timestamp	= m->time;
+	fpd.hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
+	fpd.hw_id	= m->ipid;
+	fpd.addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
+	fpd.addr	= m->addr;
 
 	/* This is the first entry, so just save it. */
 	if (!has_valid_entries(rec))
  
Borislav Petkov Feb. 14, 2024, 11:15 a.m. UTC | #8
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static bool has_valid_entries(u64 valid_bits)
> +{
> +	if (!(valid_bits & FMP_VALID_LIST_ENTRIES))
> +		return false;
> +
> +	if (!(valid_bits & FMP_VALID_LIST))
> +		return false;
> +
> +	return true;
> +}

Rename:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 090b60d269e7..3da3f40f1efe 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -221,7 +221,7 @@ static int update_record_on_storage(struct fru_rec *rec)
 	return ret;
 }
 
-static bool has_valid_entries(struct fru_rec *rec)
+static bool rec_has_valid_entries(struct fru_rec *rec)
 {
 	if (!(rec->fmp.validation_bits & FMP_VALID_LIST_ENTRIES))
 		return false;
@@ -286,7 +286,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 	fpd.addr	= m->addr;
 
 	/* This is the first entry, so just save it. */
-	if (!has_valid_entries(rec))
+	if (!rec_has_valid_entries(rec))
 		goto save_fpd;
 
 	/* Ignore already recorded errors. */
@@ -397,7 +397,7 @@ static void retire_mem_records(void)
 	for_each_fru(i, rec) {
 		fmp = &rec->fmp;
 
-		if (!has_valid_entries(rec))
+		if (!rec_has_valid_entries(rec))
 			continue;
 
 		cpu = get_cpu_from_fru_id(fmp->fru_id);

---

and this one:

---

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 3da3f40f1efe..813cc6a4f435 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -255,12 +255,12 @@ static bool same_fpd(struct cper_fru_poison_desc *old, struct cper_fru_poison_de
 	return true;
 }
 
-static bool is_dup_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *new)
+static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
 {
 	unsigned int i;
 
 	for (i = 0; i < rec->fmp.nr_entries; i++) {
-		if (same_fpd(get_fpd(rec, i), new)) {
+		if (same_fpd(get_fpd(rec, i), fpd)) {
 			pr_debug("Found duplicate record");
 			return true;
 		}
@@ -290,7 +290,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 		goto save_fpd;
 
 	/* Ignore already recorded errors. */
-	if (is_dup_fpd(rec, &fpd))
+	if (rec_has_fpd(rec, &fpd))
 		goto out_unlock;
 
 	if (rec->fmp.nr_entries >= max_nr_entries) {
  
Borislav Petkov Feb. 14, 2024, 11:36 a.m. UTC | #9
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static void update_fru_record(struct fru_rec *rec, struct mce *m)
> +{
> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
> +	struct cper_fru_poison_desc fpd;
> +	u32 entry = 0;
> +
> +	mutex_lock(&fmpm_update_mutex);
> +
> +	init_fpd(&fpd, m);
> +
> +	/* This is the first entry, so just save it. */
> +	if (!has_valid_entries(fmp->validation_bits))
> +		goto save_fpd;

Not needed - if it is the first entry, it'll get saved there.

> +	/* Ignore already recorded errors. */
> +	if (is_dup_fpd(rec, &fpd))
> +		goto out_unlock;
> +
> +	if (fmp->nr_entries >= max_nr_entries) {
> +		pr_warn("Exceeded number of entries for FRU 0x%016llx", fmp->fru_id);
> +		goto out_unlock;
> +	}
> +
> +	entry = fmp->nr_entries;

..

> +static void retire_dram_row(u64 addr, u64 id, u32 cpu)
> +{
> +	struct atl_err a_err;

Yap, exactly, this should use atl_err and not struct mce.

> +
> +	memset(&a_err, 0, sizeof(struct atl_err));
> +
> +	a_err.addr = addr;
> +	a_err.ipid = id;
> +	a_err.cpu  = cpu;
> +
> +	amd_retire_dram_row(&a_err);
> +}
> +
> +static int fru_mem_poison_handler(struct notifier_block *nb, unsigned long val, void *data)
> +{
> +	struct mce *m = (struct mce *)data;
> +	struct fru_rec *rec;
> +
> +	if (!mce_is_memory_error(m))
> +		return NOTIFY_DONE;
> +
> +	retire_dram_row(m->addr, m->ipid, m->extcpu);
> +
> +	/*
> +	 * This should not happen on real errors. But it could happen from

What exactly is "This" here?

> +	 * software error injection, etc.
> +	 */
> +	rec = get_fru_record(m->ppin);
> +	if (!rec)
> +		return NOTIFY_DONE;
> +
> +	update_fru_record(rec, m);
> +
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block fru_mem_poison_nb = {
> +	.notifier_call  = fru_mem_poison_handler,
> +	.priority	= MCE_PRIO_LOWEST,
> +};
> +
> +static u32 get_cpu_from_fru_id(u64 fru_id)

Fold into the single callsite.

> +{
> +	unsigned int cpu = 0;
> +
> +	/* Should there be more robust error handling if none found? */
> +	for_each_online_cpu(cpu) {
> +		if (topology_ppin(cpu) == fru_id)
> +			break;
> +	}
> +
> +	return cpu;
> +}
> +
> +static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
> +{
> +	struct cper_fru_poison_desc *fpd;
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_entries; i++) {
> +		fpd = get_fpd(rec, i);
> +
> +		if (fpd->hw_id_type != FPD_HW_ID_TYPE_MCA_IPID)
> +			continue;
> +
> +		if (fpd->addr_type != FPD_ADDR_TYPE_MCA_ADDR)
> +			continue;
> +
> +		retire_dram_row(fpd->addr, fpd->hw_id, cpu);
> +	}
> +}
> +
> +static void retire_mem_records(void)
> +{
> +	struct cper_sec_fru_mem_poison *fmp;
> +	struct fru_rec *rec;
> +	unsigned int i;
> +	u32 cpu;
> +
> +	for_each_fru(i, rec) {
> +		fmp = get_fmp(rec);
> +
> +		if (!has_valid_entries(fmp->validation_bits))
> +			continue;
> +
> +		cpu = get_cpu_from_fru_id(fmp->fru_id);

Pass in that fmp thing into retire_dram_row() so that you can delay
that get_cpu_from_fru_id() call until the moment you actually need it.

> +static int save_new_records(void)
> +{
> +	struct fru_rec *rec;
> +	unsigned int i;
> +	int ret = 0;
> +
> +	for_each_fru(i, rec) {
> +		/* Skip restored records. Should these be fixed up? */

I don't understand that question.

> +		if (rec->hdr.record_length)
> +			continue;
> +
> +		set_rec_fields(rec);
> +
> +		ret = update_record_on_storage(rec);
> +		if (ret)
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
> +static bool is_valid_fmp(struct fru_rec *rec)

fmp_is_valid()

> +{
> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
> +	u32 len = get_fmp_len(rec);
> +
> +	if (!fmp)
> +		return false;
> +
> +	if (!len)
> +		return false;
> +
> +	/* Checksum must sum to zero for the entire section. */
> +	if (do_fmp_checksum(fmp, len))
> +		return false;
> +
> +	if (!(fmp->validation_bits & FMP_VALID_ARCH_TYPE))
> +		return false;
> +
> +	if (fmp->fru_arch_type != FMP_ARCH_TYPE_X86_CPUID_1_EAX)
> +		return false;
> +
> +	if (!(fmp->validation_bits & FMP_VALID_ARCH))
> +		return false;
> +
> +	if (fmp->fru_arch != cpuid_eax(1))
> +		return false;
> +
> +	if (!(fmp->validation_bits & FMP_VALID_ID_TYPE))
> +		return false;
> +
> +	if (fmp->fru_id_type != FMP_ID_TYPE_X86_PPIN)
> +		return false;
> +
> +	if (!(fmp->validation_bits & FMP_VALID_ID))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void restore_record(struct fru_rec *new, struct fru_rec *old)
> +{
> +	/* Records larger than max_rec_len were skipped earlier. */
> +	size_t len = min(max_rec_len, old->hdr.record_length);
> +
> +	memcpy(new, old, len);
> +}

Fold into the single call site.

> +
> +static bool valid_record(struct fru_rec *old)
> +{
> +	struct fru_rec *new;
> +
> +	if (!is_valid_fmp(old)) {
> +		pr_debug("Ignoring invalid record");
> +		return false;
> +	}
> +
> +	new = get_fru_record(old->fmp.fru_id);
> +	if (!new) {
> +		pr_debug("Ignoring record for absent FRU");
> +		return false;
> +	}
> +
> +	/* What if ERST has duplicate FRU entries? */
> +	restore_record(new, old);
> +
> +	return true;
> +}
> +
> +/*
> + * Fetch saved records from persistent storage.
> + *
> + * For each found record:
> + * - If it was not created by this module, then ignore it.
> + * - If it is valid, then copy its data to the local cache.
> + * - If it is not valid, then erase it.
> + */
> +static int get_saved_records(void)
> +{
> +	struct fru_rec *old;
> +	u64 record_id;
> +	int ret, pos;
> +	ssize_t len;
> +
> +	/*
> +	 * Assume saved records match current max size.
> +	 *
> +	 * However, this may not be true depending on module parameters.

This must work with module parameters, though. Or, as said and
preferrably, there should not be any module parameters at all.

> +	 */
> +	old = kmalloc(max_rec_len, GFP_KERNEL);
> +	if (!old) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = erst_get_record_id_begin(&pos);
> +	if (ret < 0)
> +		goto out_end;
> +
> +	while (!erst_get_record_id_next(&pos, &record_id)) {
> +		/*
> +		 * Make sure to clear temporary buffer between reads to avoid
> +		 * leftover data from records of various sizes.
> +		 */
> +		memset(old, 0, max_rec_len);
> +
> +		len = erst_read_record(record_id, &old->hdr, max_rec_len,
> +				       sizeof(struct fru_rec), &CPER_CREATOR_FMP);
> +
> +		/* Should this be retried if the temporary buffer is too small? */

Only when it turns out that it is necessary.

> +		if (len < 0)
> +			continue;
> +
> +		if (!valid_record(old))
> +			erst_clear(record_id);

Where is the check which ignores the record not created by this module?

Because this clears all records it deems not valid and that thing needs
to be really careful here and be sure what exactly it clears...

Thx.
  
Borislav Petkov Feb. 14, 2024, 12:02 p.m. UTC | #10
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static unsigned int get_cpu_for_fru_num(unsigned int i)
> +{
> +	unsigned int cpu = 0;
> +
> +	/* Should there be more robust error handling if none found? */
> +	for_each_online_cpu(cpu) {

You need to block CPU hotplug operations before iterating over online
CPUs: cpus_read_lock/unlock.

> +		if (topology_physical_package_id(cpu) == i)
> +			return cpu;
> +	}
> +
> +	return cpu;
> +}

Fold this one into its single callsite.

> +
> +static void init_fmps(void)
> +{
> +	struct fru_rec *rec;
> +	unsigned int i, cpu;
> +
> +	for_each_fru(i, rec) {
> +		cpu = get_cpu_for_fru_num(i);
> +		set_fmp_fields(get_fmp(rec), cpu);
> +	}
> +}
> +
> +static int get_system_info(void)
> +{
> +	u8 model = boot_cpu_data.x86_model;

No need for that local var - just use boot_cpu_data.x86_model.

> +	/* Only load on MI300A systems for now. */
> +	if (!(model >= 0x90 && model <= 0x9f))
> +		return -ENODEV;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_AMD_PPIN)) {
> +		pr_debug("PPIN feature not available");
> +		return -ENODEV;
> +	}
> +
> +	/* Use CPU Package (Socket) as FRU for MI300 systems. */
> +	max_nr_fru = topology_max_packages();
> +	if (!max_nr_fru)
> +		return -ENODEV;
> +
> +	if (!max_nr_entries)
> +		max_nr_entries = FMPM_DEFAULT_MAX_NR_ENTRIES;
> +
> +	max_rec_len  = sizeof(struct fru_rec);
> +	max_rec_len += sizeof(struct cper_fru_poison_desc) * max_nr_entries;
> +
> +	pr_debug("max_nr_fru=%u max_nr_entries=%u, max_rec_len=%lu",
> +		 max_nr_fru, max_nr_entries, max_rec_len);
> +	return 0;
> +}
> +
> +static void deallocate_records(void)

free_records

> +{
> +	struct fru_rec *rec;
> +	int i;
> +
> +	for_each_fru(i, rec)
> +		kfree(rec);
> +
> +	kfree(fru_records);
> +}
> +
> +static int allocate_records(void)
> +{
> +	int i, ret = 0;
> +
> +	fru_records = kcalloc(max_nr_fru, sizeof(struct fru_rec *), GFP_KERNEL);
> +	if (!fru_records) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < max_nr_fru; i++) {
> +		fru_records[i] = kzalloc(max_rec_len, GFP_KERNEL);
> +		if (!fru_records[i]) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +	}
> +
> +	return ret;
> +
> +out_free:
> +	for (; i >= 0; i--)
> +		kfree(fru_records[i]);
> +
> +	kfree(fru_records);
> +out:
> +	return ret;
> +}
> +
> +static const struct x86_cpu_id fmpm_cpuids[] = {
> +	X86_MATCH_VENDOR_FAM(AMD, 0x19, NULL),

This is why this should depend on AMD in Kconfig.
  
Yazen Ghannam Feb. 14, 2024, 2:21 p.m. UTC | #11
On 2/14/2024 4:06 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> Memory errors are an expected occurrence on systems with high memory
>> density. Generally, errors within a small number of unique physical
>> locations is acceptable, based on manufacturer and/or admin policy.
>> During run time, memory with errors may be retired so it is no longer
>> used by the system. This is done in the kernel memory manager, and the
>> effect will remain until the system is restarted.
>>
>> If a memory location is consistently faulty, then the same run time
>> error handling may occur in the next reboot cycle. Running jobs may be
>> terminated due to previously known bad memory. This could be prevented
>> if information from the previous boot was not lost.
>>
>> Some add-in cards with driver-managed memory have on-board persistent
>> storage. Their driver may save memory error information to the
>> persistent storage during run time. The information may then be restored
>> after reset, and known bad memory may be retired before use. A running
>> log of bad memory locations is kept across multiple resets.
> 
> Too many "may"s above, please tone them down.
>

Will try :)
  
>> A similar solution is desirable for CPUs. However, this solution should
> 
> GPUs you mean?
>

I mean CPUs. GPUs would fall under the "add-in" card scenario.
  
>> leverage industry-standard components, as much as possible, rather than
>> a bespoke platform driver.
>>
>> Two components are needed: a record format and a persistent storage
>> interface.
>>
>> A UEFI CPER "FRU Memory Poison Section" is being proposed, along with a
>> "Memory Poison Descriptor", to use for this purpose. These new structures
>> are minimal, saving space on limited non-volatile memory, and extensible.
>>
>> CPER-aware persistent storage interfaces, like ACPI ERST and EFI Runtime
>> Variables, can be used. A new interface is not required.
> 
> I don't think stuff which is being proposed belongs here.
>

Do you mean this should be left out of the commit message?
  
>> Implement a new module to manage the record formats on persistent
>> storage. Use the requirements for an AMD MI300-based system to start.
>> Vendor- and platform-specific details can be abstracted later as needed.
> 
> This is a big diff so I'm splitting mails.
> 

Okay.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:28 p.m. UTC | #12
On 2/14/2024 4:28 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +config RAS_FMPM
>> +	tristate "FRU Memory Poison Manager"
>> +	default m
>> +	depends on X86_MCE
> 
> I know this is generic-ish but it needs to be enabled only on AMD for
> now. Whoever wants it somewhere else, then whoever needs to test it
> there first and then enable it there.
>

Ack.
  
>> +	imply AMD_ATL
>> +	help
>> +	  Support saving and restoring memory error information across reboot
>> +	  cycles using ACPI ERST as persistent storage. Error information is
> 
> s/cycles//
>

Ack.
  
>> +	  saved with the UEFI CPER "FRU Memory Poison" section format.
>> +
>> +	  Memory may be retired during boot time and run time depending on
> 
> s/may/is/
> 
> Please check all your text - too many "may"s for something which is not
> a vendor doc. :)
>

Ack.
  
>> +	  platform-specific policies.
>> +
>>   endif
>> diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
>> index 3fac80f58005..11f95d59d397 100644
>> --- a/drivers/ras/Makefile
>> +++ b/drivers/ras/Makefile
>> @@ -3,4 +3,5 @@ obj-$(CONFIG_RAS)	+= ras.o
>>   obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
>>   obj-$(CONFIG_RAS_CEC)	+= cec.o
>>   
>> +obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
>>   obj-y			+= amd/atl/
>> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
>> new file mode 100644
>> index 000000000000..077d9f35cc7d
>> --- /dev/null
>> +++ b/drivers/ras/amd/fmpm.c
>> @@ -0,0 +1,776 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * FRU (Field-Replaceable Unit) Memory Poison Manager
>> + *
>> + * Copyright (c) 2024, Advanced Micro Devices, Inc.
>> + * All Rights Reserved.
>> + *
>> + * Authors:
>> + *	Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
>> + *	Muralidhara M K <muralidhara.mk@amd.com>
>> + *	Yazen Ghannam <Yazen.Ghannam@amd.com>
>> + *
>> + * Implementation notes, assumptions, and limitations:
>> + *
>> + * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
>> + *   included in the UEFI specification. So they are defined here. Afterwards, they
>> + *   may be moved to linux/cper.h, if appropriate.
>> + *
>> + * - Platforms based on AMD MI300 systems will be the first to use these structures.
>> + *   There are a number of assumptions made here that will need to be generalized
>> + *   to support other platforms.
>> + *
>> + *   AMD MI300-based platform(s) assumptions:
>> + *   - Memory errors are reported through x86 MCA.
>> + *   - The entire DRAM row containing a memory error should be retired.
>> + *   - There will be (1) FRU Memory Poison Section per CPER.
>> + *   - The FRU will be the CPU Package (Processor Socket).
>> + *   - The default number of Memory Poison Descriptor entries should be (8).
>> + *   - The Platform will use ACPI ERST for persistent storage.
>> + *   - All FRU records should be saved to persistent storage. Module init will
>> + *     fail if any FRU record is not successfully written.
> 
> Please drop all that capitalized spelling.
>

For which parts? The acronyms, structure names, or general things like package/socket?
Or all the above?
  
>> + * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
>> + *   to use this module for other vendors.
> 
> This is not needed.
>

Ack.
  
>> + * - Boot time memory retirement may occur later than ideal due to dependencies
>> + *   on other libraries and drivers. This leaves a gap where bad memory may be
>> + *   accessed during early boot stages.
>> + *
>> + * - Enough memory should be pre-allocated for each FRU record to be able to hold
>> + *   the expected number of descriptor entries. This, mostly empty, record is
>> + *   written to storage during init time. Subsequent writes to the same record
>> + *   should allow the Platform to update the stored record in-place. Otherwise,
>> + *   if the record is extended, then the Platform may need to perform costly memory
>> + *   management operations on the storage. For example, the Platform may spend time
>> + *   in Firmware copying and invalidating memory on a relatively slow SPI ROM.
> 
> That's a good thing to have here.
>

Okay.
  
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/cper.h>
>> +#include <linux/ras.h>
>> +
>> +#include <acpi/apei.h>
>> +
>> +#include <asm/cpu_device_id.h>
>> +#include <asm/mce.h>
>> +
>> +#pragma pack(1)
> 
> Is that some ugly thing to avoid adding __packed annotation to the
> structure definitions below?
> 

Yes, but __packed could be used too.

> "GCC supports several types of pragmas, primarily in order to compile
> code originally written for other compilers. Note that in general we do
> not recommend the use of pragmas; See Declaring Attributes of Functions,
> for further explanation. "
> 
> Oh, that 1 is something else:
> 
> -fpack-struct[=n]
> 
>      Without a value specified, pack all structure members together
>      without holes. When a value is specified (which must be a small
>      power of two), pack structure members according to this value,
>      representing the maximum alignment (that is, objects with default
>      alignment requirements larger than this are output potentially
>      unaligned at the next fitting location.
> 
> So do I understand it correctly that struct members should be aligned to
> 2^1 bytes?
>

Yes, no padding and no reordering too, I think.
  
> Grepping the tree, this looks like something BIOS does...
>

The BIOS does do this on its side. We need to make sure to do it on the kernel
side.

See <linux/cper.h> and <acpi/actbl1.h> for examples.
  
Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:45 p.m. UTC | #13
On 2/14/2024 5:29 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +/* FRU Memory Poison Section, UEFI vX.Y sec N.X.Z */
> 
> Whack those:
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 328e0a962c23..0246b13b5ba1 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -72,7 +72,7 @@
>   /* FRU ID Types */
>   #define FMP_ID_TYPE_X86_PPIN		0
>   
> -/* FRU Memory Poison Section, UEFI vX.Y sec N.X.Z */
> +/* FRU Memory Poison Section */
>   struct cper_sec_fru_mem_poison {
>   	u32 checksum;
>   	u64 validation_bits;
> @@ -89,7 +89,7 @@ struct cper_sec_fru_mem_poison {
>   /* FRU Descriptor Address Types */
>   #define FPD_ADDR_TYPE_MCA_ADDR		0
>   
> -/* Memory Poison Descriptor, UEFI vX.Y sec N.X.Y */
> +/* Memory Poison Descriptor */
>   struct cper_fru_poison_desc {
>   	u64 timestamp;
>   	u32 hw_id_type;
> 
>

Ack.
  
>> +/**
>> + * DOC: fru_poison_entries (byte)
>> + * Maximum number of descriptor entries possible for each FRU.
>> + *
>> + * Values between '1' and '255' are valid.
>> + * No input or '0' will default to FMPM_DEFAULT_MAX_NR_ENTRIES.
>> + */
>> +static u8 max_nr_entries;
>> +module_param(max_nr_entries, byte, 0644);
>> +MODULE_PARM_DESC(max_nr_entries,
>> +		 "Maximum number of memory poison descriptor entries per FRU");
> 
> Why is there a module parameter?
>

I didn't think too much on this one. I kept the idea from the old set.

Murali, Naveen, any comments?
  
> So that people can brick their BIOSes if it can't handle some size?
>

The ERST operations should fail and return an error status if there's not enough
space.
  
> Can we read out the max size of the area destined for FRU records from
> somewhere and go with it?
>

This idea was done in the old set. But it's not correct, IMO.

The 'size' we can see from ERST isn't necessarily the available storage size.
For the !NVRAM cases, it's the size of the temporary bounce buffer that BIOS
can use to pass records to and from the OS.

So it would be big enough to old the largest record BIOS expects. It doesn't
have to match the record size used in this module. ERST is used by other code
with different records.
  
>> +#define FMPM_DEFAULT_MAX_NR_ENTRIES	8
>> +
>> +/* Maximum number of FRUs in the system. */
>> +static unsigned int max_nr_fru;
>> +
>> +/* Total length of record including headers and list of descriptor entries. */
>> +static size_t max_rec_len;
>> +
>> +/*
>> + * Protect the local cache and prevent concurrent writes to storage.
> 
> "local cache"?
>

Yes, we keep a local copy of the records within the module. That way we just need
to update the local copy and write it down to the platform. This saves time and
avoids interrupting the platform to do an extra read.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:46 p.m. UTC | #14
On 2/14/2024 5:34 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +#include <asm/mce.h>
> 
> ...
> 
>> +static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
> 								^^^^^^
> 
> 
> This is a generic thing and thus can't use an x86-ism struct mce,
> remember?
> 

Yep, that's one of the assumptions/limitations I highlighted.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:49 p.m. UTC | #15
On 2/14/2024 5:51 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static inline struct cper_sec_fru_mem_poison *get_fmp(struct fru_rec *rec)
>> +{
>> +	return &rec->fmp;
>> +}
> 
> Let's get rid of those silly helpers, diff for this one below.
> 
> The logic is, you pass around struct fru_rec *rec and inside the
> functions you deref what you need.
>

Okay, agreed.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:56 p.m. UTC | #16
On 2/14/2024 6:05 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static inline struct cper_fru_poison_desc *get_fpd(struct fru_rec *rec, u32 entry)
>> +{
>> +	return &rec->entries[entry];
>> +}
> 
> This one needs to go too.
>

Ack.
  
>> +static inline u32 get_fmp_len(struct fru_rec *rec)
>> +{
>> +	return rec->sec_desc.section_length - sizeof(struct cper_section_descriptor);
>> +}
> 
> Oh well, I guess we can keep that one.
>

Okay.
  
>> +/* Calculate a new checksum. */
>> +static u32 get_fmp_checksum(struct fru_rec *rec)
>> +{
>> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
>> +	u32 len, checksum;
>> +
>> +	len = get_fmp_len(rec);
>> +
>> +	/* Get the current total. */
>> +	checksum = do_fmp_checksum(fmp, len);
>> +
>> +	/* Subtract the current checksum from total. */
>> +	checksum -= fmp->checksum;
>> +
>> +	/* Return the compliment value. */
>> +	return 0 - checksum;
>> +}
> 
> Let's get rid of that one.
> 
> Also, I think it is called *complement* value and you simply do
>

Yep, good catch! (compliment)
  
>          /* Use the complement value. */
>          rec->fmp.checksum = -checksum;
> 
> I'd say.
>

This was my first thought. Other checksum code in the kernel does
the (0-X) thing. So I wasn't sure if there's any odd side effects
of one over the other. And I didn't take the time to dig into it.
  
> ---
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 9eaf892e35b9..f8799beddcc4 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -195,11 +195,12 @@ static u32 do_fmp_checksum(struct cper_sec_fru_mem_poison *fmp, u32 len)
>   	return checksum;
>   }
>   
> -/* Calculate a new checksum. */
> -static u32 get_fmp_checksum(struct fru_rec *rec)

I made this a helper because we need to validate the checksum when
reading records from storage too.

> +static int update_record_on_storage(struct fru_rec *rec)
>   {
>   	u32 len, checksum;
> +	int ret;
>   
> +	/* Calculate a new checksum. */
>   	len = get_fmp_len(rec);
>   
>   	/* Get the current total. */
> @@ -208,15 +209,8 @@ static u32 get_fmp_checksum(struct fru_rec *rec)
>   	/* Subtract the current checksum from total. */
>   	checksum -= rec->fmp.checksum;
>   
> -	/* Return the compliment value. */
> -	return 0 - checksum;
> -}
> -
> -static int update_record_on_storage(struct fru_rec *rec)
> -{
> -	int ret;
> -
> -	rec->fmp.checksum = get_fmp_checksum(rec);
> +	/* Use the complement value. */
> +	rec->fmp.checksum = -checksum;
>   
>   	pr_debug("Writing to storage");
>   
> 

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:57 p.m. UTC | #17
On 2/14/2024 6:10 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
>> +{
>> +	memset(fpd, 0, sizeof(struct cper_fru_poison_desc));
>> +
>> +	fpd->timestamp	= m->time;
>> +	fpd->hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
>> +	fpd->hw_id	= m->ipid;
>> +	fpd->addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
>> +	fpd->addr	= m->addr;
>> +}
> 
> Get rid of that one:
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index f8799beddcc4..090b60d269e7 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -221,17 +221,6 @@ static int update_record_on_storage(struct fru_rec *rec)
>   	return ret;
>   }
>   
> -static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
> -{
> -	memset(fpd, 0, sizeof(struct cper_fru_poison_desc));
> -
> -	fpd->timestamp	= m->time;
> -	fpd->hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
> -	fpd->hw_id	= m->ipid;
> -	fpd->addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
> -	fpd->addr	= m->addr;
> -}
> -
>   static bool has_valid_entries(struct fru_rec *rec)
>   {
>   	if (!(rec->fmp.validation_bits & FMP_VALID_LIST_ENTRIES))
> @@ -288,7 +277,13 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
>   
>   	mutex_lock(&fmpm_update_mutex);
>   
> -	init_fpd(&fpd, m);
> +	memset(&fpd, 0, sizeof(struct cper_fru_poison_desc));
> +
> +	fpd.timestamp	= m->time;
> +	fpd.hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
> +	fpd.hw_id	= m->ipid;
> +	fpd.addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
> +	fpd.addr	= m->addr;
>   
>   	/* This is the first entry, so just save it. */
>   	if (!has_valid_entries(rec))
> 

Okay.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 2:59 p.m. UTC | #18
On 2/14/2024 6:15 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static bool has_valid_entries(u64 valid_bits)
>> +{
>> +	if (!(valid_bits & FMP_VALID_LIST_ENTRIES))
>> +		return false;
>> +
>> +	if (!(valid_bits & FMP_VALID_LIST))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> Rename:
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 090b60d269e7..3da3f40f1efe 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -221,7 +221,7 @@ static int update_record_on_storage(struct fru_rec *rec)
>   	return ret;
>   }
>   
> -static bool has_valid_entries(struct fru_rec *rec)
> +static bool rec_has_valid_entries(struct fru_rec *rec)
>   {
>   	if (!(rec->fmp.validation_bits & FMP_VALID_LIST_ENTRIES))
>   		return false;
> @@ -286,7 +286,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
>   	fpd.addr	= m->addr;
>   
>   	/* This is the first entry, so just save it. */
> -	if (!has_valid_entries(rec))
> +	if (!rec_has_valid_entries(rec))
>   		goto save_fpd;
>   
>   	/* Ignore already recorded errors. */
> @@ -397,7 +397,7 @@ static void retire_mem_records(void)
>   	for_each_fru(i, rec) {
>   		fmp = &rec->fmp;
>   
> -		if (!has_valid_entries(rec))
> +		if (!rec_has_valid_entries(rec))
>   			continue;
>   
>   		cpu = get_cpu_from_fru_id(fmp->fru_id);
> 
> ---
> 
> and this one:
> 
> ---
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 3da3f40f1efe..813cc6a4f435 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -255,12 +255,12 @@ static bool same_fpd(struct cper_fru_poison_desc *old, struct cper_fru_poison_de
>   	return true;
>   }
>   
> -static bool is_dup_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *new)
> +static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
>   {
>   	unsigned int i;
>   
>   	for (i = 0; i < rec->fmp.nr_entries; i++) {
> -		if (same_fpd(get_fpd(rec, i), new)) {
> +		if (same_fpd(get_fpd(rec, i), fpd)) {
>   			pr_debug("Found duplicate record");
>   			return true;
>   		}
> @@ -290,7 +290,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
>   		goto save_fpd;
>   
>   	/* Ignore already recorded errors. */
> -	if (is_dup_fpd(rec, &fpd))
> +	if (rec_has_fpd(rec, &fpd))
>   		goto out_unlock;
>   
>   	if (rec->fmp.nr_entries >= max_nr_entries) {
> 

Okay.

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 3:26 p.m. UTC | #19
On 2/14/2024 6:36 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static void update_fru_record(struct fru_rec *rec, struct mce *m)
>> +{
>> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
>> +	struct cper_fru_poison_desc fpd;
>> +	u32 entry = 0;
>> +
>> +	mutex_lock(&fmpm_update_mutex);
>> +
>> +	init_fpd(&fpd, m);
>> +
>> +	/* This is the first entry, so just save it. */
>> +	if (!has_valid_entries(fmp->validation_bits))
>> +		goto save_fpd;
> 
> Not needed - if it is the first entry, it'll get saved there.
>

Get saved where?

For brand new records, the module will allocate them with the headers
and no descriptor entries (empty list).
  
>> +	/* Ignore already recorded errors. */
>> +	if (is_dup_fpd(rec, &fpd))
>> +		goto out_unlock;
>> +
>> +	if (fmp->nr_entries >= max_nr_entries) {
>> +		pr_warn("Exceeded number of entries for FRU 0x%016llx", fmp->fru_id);
>> +		goto out_unlock;
>> +	}
>> +
>> +	entry = fmp->nr_entries;
> 
> ...
> 
>> +static void retire_dram_row(u64 addr, u64 id, u32 cpu)
>> +{
>> +	struct atl_err a_err;
> 
> Yap, exactly, this should use atl_err and not struct mce.
>

Yes, tried to do *some* things generic.
  
>> +
>> +	memset(&a_err, 0, sizeof(struct atl_err));
>> +
>> +	a_err.addr = addr;
>> +	a_err.ipid = id;
>> +	a_err.cpu  = cpu;
>> +
>> +	amd_retire_dram_row(&a_err);
>> +}
>> +
>> +static int fru_mem_poison_handler(struct notifier_block *nb, unsigned long val, void *data)
>> +{
>> +	struct mce *m = (struct mce *)data;
>> +	struct fru_rec *rec;
>> +
>> +	if (!mce_is_memory_error(m))
>> +		return NOTIFY_DONE;
>> +
>> +	retire_dram_row(m->addr, m->ipid, m->extcpu);
>> +
>> +	/*
>> +	 * This should not happen on real errors. But it could happen from
> 
> What exactly is "This" here?
>

Ah right. The module should have created, or restored, a record for each FRU
in the system during module init. So the runtime handler should always find
a valid record for a FRU. The only exception I could think of, besides bugs,
is if the user does software error injection and a valid FRU ID doesn't get
set.
  
>> +	 * software error injection, etc.
>> +	 */
>> +	rec = get_fru_record(m->ppin);
>> +	if (!rec)
>> +		return NOTIFY_DONE;
>> +
>> +	update_fru_record(rec, m);
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block fru_mem_poison_nb = {
>> +	.notifier_call  = fru_mem_poison_handler,
>> +	.priority	= MCE_PRIO_LOWEST,
>> +};
>> +
>> +static u32 get_cpu_from_fru_id(u64 fru_id)
> 
> Fold into the single callsite.
>

Ack.
  
>> +{
>> +	unsigned int cpu = 0;
>> +
>> +	/* Should there be more robust error handling if none found? */
>> +	for_each_online_cpu(cpu) {
>> +		if (topology_ppin(cpu) == fru_id)
>> +			break;
>> +	}
>> +
>> +	return cpu;
>> +}
>> +
>> +static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
>> +{
>> +	struct cper_fru_poison_desc *fpd;
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr_entries; i++) {
>> +		fpd = get_fpd(rec, i);
>> +
>> +		if (fpd->hw_id_type != FPD_HW_ID_TYPE_MCA_IPID)
>> +			continue;
>> +
>> +		if (fpd->addr_type != FPD_ADDR_TYPE_MCA_ADDR)
>> +			continue;
>> +
>> +		retire_dram_row(fpd->addr, fpd->hw_id, cpu);
>> +	}
>> +}
>> +
>> +static void retire_mem_records(void)
>> +{
>> +	struct cper_sec_fru_mem_poison *fmp;
>> +	struct fru_rec *rec;
>> +	unsigned int i;
>> +	u32 cpu;
>> +
>> +	for_each_fru(i, rec) {
>> +		fmp = get_fmp(rec);
>> +
>> +		if (!has_valid_entries(fmp->validation_bits))
>> +			continue;
>> +
>> +		cpu = get_cpu_from_fru_id(fmp->fru_id);
> 
> Pass in that fmp thing into retire_dram_row() so that you can delay
> that get_cpu_from_fru_id() call until the moment you actually need it.
>

Okay.
  
>> +static int save_new_records(void)
>> +{
>> +	struct fru_rec *rec;
>> +	unsigned int i;
>> +	int ret = 0;
>> +
>> +	for_each_fru(i, rec) {
>> +		/* Skip restored records. Should these be fixed up? */
> 
> I don't understand that question.
>

I think there's a case where the record on storage can space for fewer
than the target number of entries.

For example, module wants to have space for 8 entries per record. A record
from storage has valid entries (which should be restored and memory retired),
but it only has space for 4 entries.

So should the module fix up the record from storage by adjusting its record
length and then writing it back down?
  
>> +		if (rec->hdr.record_length)
>> +			continue;
>> +
>> +		set_rec_fields(rec);
>> +
>> +		ret = update_record_on_storage(rec);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static bool is_valid_fmp(struct fru_rec *rec)
> 
> fmp_is_valid()
>

Ack.
  
>> +{
>> +	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
>> +	u32 len = get_fmp_len(rec);
>> +
>> +	if (!fmp)
>> +		return false;
>> +
>> +	if (!len)
>> +		return false;
>> +
>> +	/* Checksum must sum to zero for the entire section. */
>> +	if (do_fmp_checksum(fmp, len))
>> +		return false;
>> +
>> +	if (!(fmp->validation_bits & FMP_VALID_ARCH_TYPE))
>> +		return false;
>> +
>> +	if (fmp->fru_arch_type != FMP_ARCH_TYPE_X86_CPUID_1_EAX)
>> +		return false;
>> +
>> +	if (!(fmp->validation_bits & FMP_VALID_ARCH))
>> +		return false;
>> +
>> +	if (fmp->fru_arch != cpuid_eax(1))
>> +		return false;
>> +
>> +	if (!(fmp->validation_bits & FMP_VALID_ID_TYPE))
>> +		return false;
>> +
>> +	if (fmp->fru_id_type != FMP_ID_TYPE_X86_PPIN)
>> +		return false;
>> +
>> +	if (!(fmp->validation_bits & FMP_VALID_ID))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static void restore_record(struct fru_rec *new, struct fru_rec *old)
>> +{
>> +	/* Records larger than max_rec_len were skipped earlier. */
>> +	size_t len = min(max_rec_len, old->hdr.record_length);
>> +
>> +	memcpy(new, old, len);
>> +}
> 
> Fold into the single call site.
>

Ack.
  
>> +
>> +static bool valid_record(struct fru_rec *old)
>> +{
>> +	struct fru_rec *new;
>> +
>> +	if (!is_valid_fmp(old)) {
>> +		pr_debug("Ignoring invalid record");
>> +		return false;
>> +	}
>> +
>> +	new = get_fru_record(old->fmp.fru_id);
>> +	if (!new) {
>> +		pr_debug("Ignoring record for absent FRU");
>> +		return false;
>> +	}
>> +
>> +	/* What if ERST has duplicate FRU entries? */
>> +	restore_record(new, old);
>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * Fetch saved records from persistent storage.
>> + *
>> + * For each found record:
>> + * - If it was not created by this module, then ignore it.
>> + * - If it is valid, then copy its data to the local cache.
>> + * - If it is not valid, then erase it.
>> + */
>> +static int get_saved_records(void)
>> +{
>> +	struct fru_rec *old;
>> +	u64 record_id;
>> +	int ret, pos;
>> +	ssize_t len;
>> +
>> +	/*
>> +	 * Assume saved records match current max size.
>> +	 *
>> +	 * However, this may not be true depending on module parameters.
> 
> This must work with module parameters, though. Or, as said and
> preferrably, there should not be any module parameters at all.
>

I agree though I'm not sure which course to take.

Generally, I think the code is *safe* as-is. But it isn't ideal.

If the module expects records with 8 entries, and storage has records
with 4, then we can still get the data. And the "fix up" question applies
from above.

If the module expects records with 4 entries, and storage has records
with 8, then the module will ignore the stored records. This isn't
ideal as the module misses out on possible valid information.
  
>> +	 */
>> +	old = kmalloc(max_rec_len, GFP_KERNEL);
>> +	if (!old) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	ret = erst_get_record_id_begin(&pos);
>> +	if (ret < 0)
>> +		goto out_end;
>> +
>> +	while (!erst_get_record_id_next(&pos, &record_id)) {
>> +		/*
>> +		 * Make sure to clear temporary buffer between reads to avoid
>> +		 * leftover data from records of various sizes.
>> +		 */
>> +		memset(old, 0, max_rec_len);
>> +
>> +		len = erst_read_record(record_id, &old->hdr, max_rec_len,
>> +				       sizeof(struct fru_rec), &CPER_CREATOR_FMP);
>> +
>> +		/* Should this be retried if the temporary buffer is too small? */
> 
> Only when it turns out that it is necessary.
>

Right, but how do we know? I think it is necessary given what I just wrote above.
We don't want to miss out on valid info.

>> +		if (len < 0)
>> +			continue;
>> +
>> +		if (!valid_record(old))
>> +			erst_clear(record_id);
> 
> Where is the check which ignores the record not created by this module?
> 
> Because this clears all records it deems not valid and that thing needs
> to be really careful here and be sure what exactly it clears...
>

erst_read_record() only returns records with the given creator_id. If the
record doesn't have a matching creator_id, then we'll get a (len < 0).

Thanks,
Yazen
  
Yazen Ghannam Feb. 14, 2024, 3:33 p.m. UTC | #20
On 2/14/2024 7:02 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +static unsigned int get_cpu_for_fru_num(unsigned int i)
>> +{
>> +	unsigned int cpu = 0;
>> +
>> +	/* Should there be more robust error handling if none found? */
>> +	for_each_online_cpu(cpu) {
> 
> You need to block CPU hotplug operations before iterating over online
> CPUs: cpus_read_lock/unlock.
>

Okay.
  
>> +		if (topology_physical_package_id(cpu) == i)
>> +			return cpu;
>> +	}
>> +
>> +	return cpu;
>> +}
> 
> Fold this one into its single callsite.
>

Ack.
  
>> +
>> +static void init_fmps(void)
>> +{
>> +	struct fru_rec *rec;
>> +	unsigned int i, cpu;
>> +
>> +	for_each_fru(i, rec) {
>> +		cpu = get_cpu_for_fru_num(i);
>> +		set_fmp_fields(get_fmp(rec), cpu);
>> +	}
>> +}
>> +
>> +static int get_system_info(void)
>> +{
>> +	u8 model = boot_cpu_data.x86_model;
> 
> No need for that local var - just use boot_cpu_data.x86_model.
>

Ack.
  
>> +	/* Only load on MI300A systems for now. */
>> +	if (!(model >= 0x90 && model <= 0x9f))
>> +		return -ENODEV;
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_AMD_PPIN)) {
>> +		pr_debug("PPIN feature not available");
>> +		return -ENODEV;
>> +	}
>> +
>> +	/* Use CPU Package (Socket) as FRU for MI300 systems. */
>> +	max_nr_fru = topology_max_packages();
>> +	if (!max_nr_fru)
>> +		return -ENODEV;
>> +
>> +	if (!max_nr_entries)
>> +		max_nr_entries = FMPM_DEFAULT_MAX_NR_ENTRIES;
>> +
>> +	max_rec_len  = sizeof(struct fru_rec);
>> +	max_rec_len += sizeof(struct cper_fru_poison_desc) * max_nr_entries;
>> +
>> +	pr_debug("max_nr_fru=%u max_nr_entries=%u, max_rec_len=%lu",
>> +		 max_nr_fru, max_nr_entries, max_rec_len);
>> +	return 0;
>> +}
>> +
>> +static void deallocate_records(void)
> 
> free_records
>

Ack.
  
>> +{
>> +	struct fru_rec *rec;
>> +	int i;
>> +
>> +	for_each_fru(i, rec)
>> +		kfree(rec);
>> +
>> +	kfree(fru_records);
>> +}
>> +
>> +static int allocate_records(void)
>> +{
>> +	int i, ret = 0;
>> +
>> +	fru_records = kcalloc(max_nr_fru, sizeof(struct fru_rec *), GFP_KERNEL);
>> +	if (!fru_records) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	for (i = 0; i < max_nr_fru; i++) {
>> +		fru_records[i] = kzalloc(max_rec_len, GFP_KERNEL);
>> +		if (!fru_records[i]) {
>> +			ret = -ENOMEM;
>> +			goto out_free;
>> +		}
>> +	}
>> +
>> +	return ret;
>> +
>> +out_free:
>> +	for (; i >= 0; i--)
>> +		kfree(fru_records[i]);
>> +
>> +	kfree(fru_records);
>> +out:
>> +	return ret;
>> +}
>> +
>> +static const struct x86_cpu_id fmpm_cpuids[] = {
>> +	X86_MATCH_VENDOR_FAM(AMD, 0x19, NULL),
> 
> This is why this should depend on AMD in Kconfig.
> 

Okay.

I was also thinking that MODULE_DEVICE_TABLE shouldn't be used. Not all
MI300-based systems will need or can use this module. And it does depend
on specific platform configurations.

So the module should not autoload. Users will need to manually load it if
they know that it's usable on their platform. We can keep the cpuid[] and
model checks just for extra safety.

Thanks,
Yazen
  
Borislav Petkov Feb. 14, 2024, 3:49 p.m. UTC | #21
On Wed, Feb 14, 2024 at 09:21:45AM -0500, Yazen Ghannam wrote:
> Do you mean this should be left out of the commit message?

Yes, the text should talk only about what the patch does. What can and
will and won't happen in the future doesn't matter.

IOW, here's what I have now:

RAS: Introduce a FRU memory poison manager

Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations are acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in mm through page poisoning, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle, leading to
terminating jobs due to that already known bad memory. This could be
prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver saves memory error information to the persistent
storage during run time. The information is then be restored after
reset, and known bad memory will be retired before the hardware is used.
A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should
leverage industry-standard components as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

  [ bp: Massage commit message. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
  
Yazen Ghannam Feb. 14, 2024, 4:01 p.m. UTC | #22
On 2/14/2024 10:49 AM, Borislav Petkov wrote:
> On Wed, Feb 14, 2024 at 09:21:45AM -0500, Yazen Ghannam wrote:
>> Do you mean this should be left out of the commit message?
> 
> Yes, the text should talk only about what the patch does. What can and
> will and won't happen in the future doesn't matter.
>

Got it.
  
> IOW, here's what I have now:
> 
> RAS: Introduce a FRU memory poison manager
> 
> Memory errors are an expected occurrence on systems with high memory
> density. Generally, errors within a small number of unique physical
> locations are acceptable, based on manufacturer and/or admin policy.
> During run time, memory with errors may be retired so it is no longer
> used by the system. This is done in mm through page poisoning, and the
> effect will remain until the system is restarted.
> 
> If a memory location is consistently faulty, then the same run time
> error handling may occur in the next reboot cycle, leading to
> terminating jobs due to that already known bad memory. This could be
> prevented if information from the previous boot was not lost.
> 
> Some add-in cards with driver-managed memory have on-board persistent
> storage. Their driver saves memory error information to the persistent
> storage during run time. The information is then be restored after

"then be" -> "then"

> reset, and known bad memory will be retired before the hardware is used.
> A running log of bad memory locations is kept across multiple resets.
> 
> A similar solution is desirable for CPUs. However, this solution should
> leverage industry-standard components as much as possible, rather than
> a bespoke platform driver.
> 
> Two components are needed: a record format and a persistent storage
> interface.
> 
> Implement a new module to manage the record formats on persistent
> storage. Use the requirements for an AMD MI300-based system to start.
> Vendor- and platform-specific details can be abstracted later as needed.
> 
>    [ bp: Massage commit message. ]
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
> 

Otherwise, looks good.

Thanks,
Yazen
  
Borislav Petkov Feb. 14, 2024, 4:11 p.m. UTC | #23
On Wed, Feb 14, 2024 at 11:01:36AM -0500, Yazen Ghannam wrote:
> "then be" -> "then"

Fixed.

> Otherwise, looks good.

Thx.
  
Borislav Petkov Feb. 14, 2024, 5:50 p.m. UTC | #24
On Wed, Feb 14, 2024 at 09:28:54AM -0500, Yazen Ghannam wrote:
> > That's a good thing to have here.

Up to here. __packed still needs clarification.

diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index 782951aa302f..f5dde88a3188 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -37,14 +37,13 @@ source "drivers/ras/amd/atl/Kconfig"
 config RAS_FMPM
 	tristate "FRU Memory Poison Manager"
 	default m
-	depends on X86_MCE
-	imply AMD_ATL
+	depends on AMD_ATL
 	help
 	  Support saving and restoring memory error information across reboot
-	  cycles using ACPI ERST as persistent storage. Error information is
-	  saved with the UEFI CPER "FRU Memory Poison" section format.
+	  using ACPI ERST as persistent storage. Error information is saved with
+	  the UEFI CPER "FRU Memory Poison" section format.
 
-	  Memory may be retired during boot time and run time depending on
+	  Memory will be retired during boot time and run time depending on
 	  platform-specific policies.
 
 endif
diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index d6a963aca093..901a1f0018fc 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -12,7 +12,7 @@
  *
  * Implementation notes, assumptions, and limitations:
  *
- * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
+ * - FRU memory poison section and memory poison descriptor definitions are not yet
  *   included in the UEFI specification. So they are defined here. Afterwards, they
  *   may be moved to linux/cper.h, if appropriate.
  *
@@ -23,16 +23,13 @@
  *   AMD MI300-based platform(s) assumptions:
  *   - Memory errors are reported through x86 MCA.
  *   - The entire DRAM row containing a memory error should be retired.
- *   - There will be (1) FRU Memory Poison Section per CPER.
- *   - The FRU will be the CPU Package (Processor Socket).
- *   - The default number of Memory Poison Descriptor entries should be (8).
- *   - The Platform will use ACPI ERST for persistent storage.
+ *   - There will be (1) FRU memory poison section per CPER.
+ *   - The FRU will be the CPU package (processor socket).
+ *   - The default number of memory poison descriptor entries should be (8).
+ *   - The platform will use ACPI ERST for persistent storage.
  *   - All FRU records should be saved to persistent storage. Module init will
  *     fail if any FRU record is not successfully written.
  *
- * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
- *   to use this module for other vendors.
- *
  * - Boot time memory retirement may occur later than ideal due to dependencies
  *   on other libraries and drivers. This leaves a gap where bad memory may be
  *   accessed during early boot stages.
  
Borislav Petkov Feb. 14, 2024, 6:04 p.m. UTC | #25
On Wed, Feb 14, 2024 at 09:45:14AM -0500, Yazen Ghannam wrote:
> Yes, we keep a local copy of the records within the module. That way
> we just need to update the local copy and write it down to the
> platform. This saves time and avoids interrupting the platform to do
> an extra read.

I guess this:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 901a1f0018fc..99499a37e9d5 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -142,8 +142,9 @@ static unsigned int max_nr_fru;
 static size_t max_rec_len;
 
 /*
- * Protect the local cache and prevent concurrent writes to storage.
- * This is only needed after init once notifier block registration is done.
+ * Protect the local records cache in fru_records and prevent concurrent
+ * writes to storage. This is only needed after init once notifier block
+ * registration is done.
  */
 static DEFINE_MUTEX(fmpm_update_mutex);
 
I'm still don't like the module param...
  
Borislav Petkov Feb. 14, 2024, 6:21 p.m. UTC | #26
On Wed, Feb 14, 2024 at 06:50:35PM +0100, Borislav Petkov wrote:
> On Wed, Feb 14, 2024 at 09:28:54AM -0500, Yazen Ghannam wrote:
> > > That's a good thing to have here.
> 
> Up to here. __packed still needs clarification.

Yap, that is plain old __packed, as we just established:

---

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 99499a37e9d5..a67a4b67cf9d 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -53,8 +53,6 @@
 #include <asm/cpu_device_id.h>
 #include <asm/mce.h>
 
-#pragma pack(1)
-
 /* Validation Bits */
 #define FMP_VALID_ARCH_TYPE		BIT_ULL(0)
 #define FMP_VALID_ARCH			BIT_ULL(1)
@@ -78,7 +76,7 @@ struct cper_sec_fru_mem_poison {
 	u32 fru_id_type;
 	u64 fru_id;
 	u32 nr_entries;
-};
+} __packed;
 
 /* FRU Descriptor ID Types */
 #define FPD_HW_ID_TYPE_MCA_IPID		0
@@ -93,7 +91,7 @@ struct cper_fru_poison_desc {
 	u64 hw_id;
 	u32 addr_type;
 	u64 addr;
-};
+} __packed;
 
 /* Collection of headers and sections for easy pointer use. */
 struct fru_rec {
@@ -101,10 +99,7 @@ struct fru_rec {
 	struct cper_section_descriptor	sec_desc;
 	struct cper_sec_fru_mem_poison	fmp;
 	struct cper_fru_poison_desc	entries[];
-};
-
-/* Reset to default packing */
-#pragma pack()
+} __packed;
 
 /*
  * Pointers to the complete CPER record of each FRU.
  
Borislav Petkov Feb. 14, 2024, 6:29 p.m. UTC | #27
On Wed, Feb 14, 2024 at 09:46:46AM -0500, Yazen Ghannam wrote:
> > This is a generic thing and thus can't use an x86-ism struct mce,
> > remember?
> > 
> 
> Yep, that's one of the assumptions/limitations I highlighted.

Right, since the notifier callback fru_mem_poison_nb is on the x86 MCE
chain and since we're making this thing depend on AMD_ATL which depends
on X86_64 currently, it is ok to have struct mce in there.

Once something else outside of x86 wants to use it, it'll have to be
decoupled and use atl_err.

Later...
  
Borislav Petkov Feb. 14, 2024, 6:44 p.m. UTC | #28
On Wed, Feb 14, 2024 at 09:56:14AM -0500, Yazen Ghannam wrote:
> > This one needs to go too.
> > 
> 
> Ack.

Gone:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index a67a4b67cf9d..643c36b6dc9c 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -146,11 +146,6 @@ static DEFINE_MUTEX(fmpm_update_mutex);
 #define for_each_fru(i, rec) \
 	for (i = 0; rec = fru_records[i], i < max_nr_fru; i++)
 
-static inline struct cper_fru_poison_desc *get_fpd(struct fru_rec *rec, u32 entry)
-{
-	return &rec->entries[entry];
-}
-
 static inline u32 get_fmp_len(struct fru_rec *rec)
 {
 	return rec->sec_desc.section_length - sizeof(struct cper_section_descriptor);
@@ -253,7 +248,9 @@ static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
 	unsigned int i;
 
 	for (i = 0; i < rec->fmp.nr_entries; i++) {
-		if (same_fpd(get_fpd(rec, i), fpd)) {
+		struct cper_fru_poison_desc *fpd_i = &rec->entries[i];
+
+		if (same_fpd(fpd_i, fpd)) {
 			pr_debug("Found duplicate record");
 			return true;
 		}
@@ -265,7 +262,7 @@ static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
 static void update_fru_record(struct fru_rec *rec, struct mce *m)
 {
 	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
-	struct cper_fru_poison_desc fpd;
+	struct cper_fru_poison_desc fpd, *fpd_dest;
 	u32 entry = 0;
 
 	mutex_lock(&fmpm_update_mutex);
@@ -287,9 +284,10 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 		goto out_unlock;
 	}
 
-	entry = fmp->nr_entries;
+	entry	  = fmp->nr_entries;
+	fpd_dest  = &rec->entries[entry];
 
-	memcpy(get_fpd(rec, entry), &fpd, sizeof(struct cper_fru_poison_desc));
+	memcpy(fpd_dest, &fpd, sizeof(struct cper_fru_poison_desc));
 
 	fmp->nr_entries		 = entry + 1;
 	fmp->validation_bits	|= FMP_VALID_LIST_ENTRIES;
@@ -359,11 +357,10 @@ static u32 get_cpu_from_fru_id(u64 fru_id)
 
 static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
 {
-	struct cper_fru_poison_desc *fpd;
 	unsigned int i;
 
 	for (i = 0; i < nr_entries; i++) {
-		fpd = get_fpd(rec, i);
+		struct cper_fru_poison_desc *fpd = &rec->entries[i];
 
 		if (fpd->hw_id_type != FPD_HW_ID_TYPE_MCA_IPID)
 			continue;


> >          /* Use the complement value. */
> >          rec->fmp.checksum = -checksum;
> > 
> > I'd say.
> > 
> 
> This was my first thought. Other checksum code in the kernel does
> the (0-X) thing. So I wasn't sure if there's any odd side effects
> of one over the other. And I didn't take the time to dig into it.

I guess to probably be more expressive? I don't see how

0 - X

and 

-X

differ.

And you can always do a before-after and look at the asm:

before:
# drivers/ras/amd/fmpm.c:202:   rec->fmp.checksum = 0 - checksum;
#NO_APP
        subl    %edx, %eax      # checksum, tmp100
        movl    %eax, 200(%rbx) # tmp100, rec_9(D)->fmp.checksum

after:
# drivers/ras/amd/fmpm.c:202:   rec->fmp.checksum = -checksum;
#NO_APP
        subl    %edx, %eax      # checksum, tmp100
        movl    %eax, 200(%rbx) # tmp100, rec_9(D)->fmp.checksum


> > -/* Calculate a new checksum. */
> > -static u32 get_fmp_checksum(struct fru_rec *rec)
> 
> I made this a helper because we need to validate the checksum when
> reading records from storage too.

It has a single user that's why I whacked it. If a new one materializes,
sure, you can carve it out.

Thx.
  
Borislav Petkov Feb. 14, 2024, 6:47 p.m. UTC | #29
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
> +static bool same_fpd(struct cper_fru_poison_desc *old, struct cper_fru_poison_desc *new)

The usual convention in the kernel is:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 643c36b6dc9c..e50f11fb90a4 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -220,7 +220,7 @@ static bool rec_has_valid_entries(struct fru_rec *rec)
 	return true;
 }
 
-static bool same_fpd(struct cper_fru_poison_desc *old, struct cper_fru_poison_desc *new)
+static bool fpds_equal(struct cper_fru_poison_desc *old, struct cper_fru_poison_desc *new)
 {
 	/*
 	 * Ignore timestamp field.
@@ -250,7 +250,7 @@ static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
 	for (i = 0; i < rec->fmp.nr_entries; i++) {
 		struct cper_fru_poison_desc *fpd_i = &rec->entries[i];
 
-		if (same_fpd(fpd_i, fpd)) {
+		if (fpds_equal(fpd_i, fpd)) {
 			pr_debug("Found duplicate record");
 			return true;
 		}
  
Borislav Petkov Feb. 14, 2024, 7:44 p.m. UTC | #30
On Wed, Feb 14, 2024 at 10:26:41AM -0500, Yazen Ghannam wrote:
> > > +	/* This is the first entry, so just save it. */
> > > +	if (!has_valid_entries(fmp->validation_bits))
> > > +		goto save_fpd;
> > 
> > Not needed - if it is the first entry, it'll get saved there.
> > 
> 
> Get saved where?
> 
> For brand new records, the module will allocate them with the headers
> and no descriptor entries (empty list).

As discussed offlist, let's stick to checking validation_bits even if
->nr_entries should be 0 when rec doesn't have valid entries yet.

So lemme readd it.

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index e50f11fb90a4..daab7f58505a 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -275,6 +275,10 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 	fpd.addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
 	fpd.addr	= m->addr;
 
+	/* This is the first entry, so just save it. */
+	if (!rec_has_valid_entries(rec))
+		goto save_fpd;
+
 	/* Ignore already recorded errors. */
 	if (rec_has_fpd(rec, &fpd))
 		goto out_unlock;
@@ -287,6 +291,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
 	entry	  = fmp->nr_entries;
 	fpd_dest  = &rec->entries[entry];
 
+save_fpd:
 	memcpy(fpd_dest, &fpd, sizeof(struct cper_fru_poison_desc));
 
 	fmp->nr_entries		 = entry + 1;


> > Yap, exactly, this should use atl_err and not struct mce.
> > 
> 
> Yes, tried to do *some* things generic.

Right.

> > > +	 * This should not happen on real errors. But it could happen from
> > 
> > What exactly is "This" here?
> > 
> 
> Ah right. The module should have created, or restored, a record for each FRU
> in the system during module init. So the runtime handler should always find
> a valid record for a FRU. The only exception I could think of, besides bugs,
> is if the user does software error injection and a valid FRU ID doesn't get
> set.

Changed to:

        /*
         * An invalid FRU ID should not happen on real errors. But it
         * could happen from software error injection, etc.
         */
        rec = get_fru_record(m->ppin);

> > Pass in that fmp thing into retire_dram_row() so that you can delay
> > that get_cpu_from_fru_id() call until the moment you actually need it.

IOW, this:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index a314598186a0..c2dc83a4e82a 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -47,6 +47,7 @@
 
 #include <linux/cper.h>
 #include <linux/ras.h>
+#include <linux/cpu.h>
 
 #include <acpi/apei.h>
 
@@ -347,21 +348,10 @@ static struct notifier_block fru_mem_poison_nb = {
 	.priority	= MCE_PRIO_LOWEST,
 };
 
-static u32 get_cpu_from_fru_id(u64 fru_id)
-{
-	unsigned int cpu = 0;
-
-	/* Should there be more robust error handling if none found? */
-	for_each_online_cpu(cpu) {
-		if (topology_ppin(cpu) == fru_id)
-			break;
-	}
-
-	return cpu;
-}
-
-static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
+static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries)
 {
+	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
+	unsigned int cpu, err_cpu = -1;
 	unsigned int i;
 
 	for (i = 0; i < nr_entries; i++) {
@@ -373,7 +363,16 @@ static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
 		if (fpd->addr_type != FPD_ADDR_TYPE_MCA_ADDR)
 			continue;
 
-		retire_dram_row(fpd->addr, fpd->hw_id, cpu);
+		cpus_read_lock();
+		for_each_online_cpu(cpu) {
+			if (topology_ppin(cpu) == fmp->fru_id) {
+				err_cpu = cpu;
+				break;
+			}
+		}
+		cpus_read_unlock();
+
+		retire_dram_row(fpd->addr, fpd->hw_id, err_cpu);
 	}
 }
 
@@ -382,7 +381,6 @@ static void retire_mem_records(void)
 	struct cper_sec_fru_mem_poison *fmp;
 	struct fru_rec *rec;
 	unsigned int i;
-	u32 cpu;
 
 	for_each_fru(i, rec) {
 		fmp = &rec->fmp;
@@ -390,9 +388,7 @@ static void retire_mem_records(void)
 		if (!rec_has_valid_entries(rec))
 			continue;
 
-		cpu = get_cpu_from_fru_id(fmp->fru_id);
-
-		retire_mem_fmp(rec, fmp->nr_entries, cpu);
+		retire_mem_fmp(rec, fmp->nr_entries);
 	}
 }
  
Borislav Petkov Feb. 14, 2024, 8:10 p.m. UTC | #31
On Wed, Feb 14, 2024 at 10:26:41AM -0500, Yazen Ghannam wrote:
> > > +static bool is_valid_fmp(struct fru_rec *rec)
> > 
> > fmp_is_valid()
> > 
> 
> Ack.

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index bcee828cb916..3600bf0dca53 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -447,7 +447,7 @@ static int save_new_records(void)
 	return ret;
 }
 
-static bool is_valid_fmp(struct fru_rec *rec)
+static bool fmp_is_valid(struct fru_rec *rec)
 {
 	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
 	u32 len = get_fmp_len(rec);
@@ -486,19 +486,12 @@ static bool is_valid_fmp(struct fru_rec *rec)
 	return true;
 }
 
-static void restore_record(struct fru_rec *new, struct fru_rec *old)
-{
-	/* Records larger than max_rec_len were skipped earlier. */
-	size_t len = min(max_rec_len, old->hdr.record_length);
-
-	memcpy(new, old, len);
-}
-
 static bool valid_record(struct fru_rec *old)
 {
 	struct fru_rec *new;
+	size_t len;
 
-	if (!is_valid_fmp(old)) {
+	if (!fmp_is_valid(old)) {
 		pr_debug("Ignoring invalid record");
 		return false;
 	}
@@ -509,8 +502,11 @@ static bool valid_record(struct fru_rec *old)
 		return false;
 	}
 
-	/* What if ERST has duplicate FRU entries? */
-	restore_record(new, old);
+	/* Records larger than max_rec_len were skipped earlier. */
+	len = min(max_rec_len, old->hdr.record_length);
+
+	/* Restore the record */
+	memcpy(new, old, len);
 
 	return true;
 }
  
Borislav Petkov Feb. 14, 2024, 8:18 p.m. UTC | #32
On Wed, Feb 14, 2024 at 10:33:15AM -0500, Yazen Ghannam wrote:
> I was also thinking that MODULE_DEVICE_TABLE shouldn't be used. Not all
> MI300-based systems will need or can use this module. And it does depend
> on specific platform configurations.
> 
> So the module should not autoload. Users will need to manually load it if
> they know that it's usable on their platform. We can keep the cpuid[] and
> model checks just for extra safety.

Ok, makes sense.

The above converted:

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index bcee828cb916..6b280cf503a4 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -447,7 +447,7 @@ static int save_new_records(void)
 	return ret;
 }
 
-static bool is_valid_fmp(struct fru_rec *rec)
+static bool fmp_is_valid(struct fru_rec *rec)
 {
 	struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
 	u32 len = get_fmp_len(rec);
@@ -486,19 +486,12 @@ static bool is_valid_fmp(struct fru_rec *rec)
 	return true;
 }
 
-static void restore_record(struct fru_rec *new, struct fru_rec *old)
-{
-	/* Records larger than max_rec_len were skipped earlier. */
-	size_t len = min(max_rec_len, old->hdr.record_length);
-
-	memcpy(new, old, len);
-}
-
 static bool valid_record(struct fru_rec *old)
 {
 	struct fru_rec *new;
+	size_t len;
 
-	if (!is_valid_fmp(old)) {
+	if (!fmp_is_valid(old)) {
 		pr_debug("Ignoring invalid record");
 		return false;
 	}
@@ -509,8 +502,11 @@ static bool valid_record(struct fru_rec *old)
 		return false;
 	}
 
-	/* What if ERST has duplicate FRU entries? */
-	restore_record(new, old);
+	/* Records larger than max_rec_len were skipped earlier. */
+	len = min(max_rec_len, old->hdr.record_length);
+
+	/* Restore the record */
+	memcpy(new, old, len);
 
 	return true;
 }
@@ -588,36 +584,35 @@ static void set_fmp_fields(struct fru_rec *rec, unsigned int cpu)
 	fmp->validation_bits |= FMP_VALID_ID;
 }
 
-static unsigned int get_cpu_for_fru_num(unsigned int i)
-{
-	unsigned int cpu = 0;
-
-	/* Should there be more robust error handling if none found? */
-	for_each_online_cpu(cpu) {
-		if (topology_physical_package_id(cpu) == i)
-			return cpu;
-	}
-
-	return cpu;
-}
-
 static void init_fmps(void)
 {
 	struct fru_rec *rec;
 	unsigned int i, cpu;
 
+	cpus_read_lock();
 	for_each_fru(i, rec) {
-		cpu = get_cpu_for_fru_num(i);
-		set_fmp_fields(rec, cpu);
+		int fru_cpu = -1;
+
+		for_each_online_cpu(cpu) {
+			if (topology_physical_package_id(cpu) == i) {
+				fru_cpu = i;
+				break;
+			}
+		}
+
+		if (fru_cpu < 0)
+			continue;
+
+		set_fmp_fields(rec, fru_cpu);
 	}
+	cpus_read_unlock();
 }
 
 static int get_system_info(void)
 {
-	u8 model = boot_cpu_data.x86_model;
-
 	/* Only load on MI300A systems for now. */
-	if (!(model >= 0x90 && model <= 0x9f))
+	if (!(boot_cpu_data.x86_model >= 0x90 &&
+	      boot_cpu_data.x86_model <= 0x9f))
 		return -ENODEV;
 
 	if (!cpu_feature_enabled(X86_FEATURE_AMD_PPIN)) {
@@ -641,7 +636,7 @@ static int get_system_info(void)
 	return 0;
 }
 
-static void deallocate_records(void)
+static void free_records(void)
 {
 	struct fru_rec *rec;
 	int i;
@@ -728,7 +723,7 @@ static int __init fru_mem_poison_init(void)
 	return 0;
 
 out_free:
-	deallocate_records();
+	free_records();
 out:
 	return ret;
 }
@@ -736,7 +731,7 @@ static int __init fru_mem_poison_init(void)
 static void __exit fru_mem_poison_exit(void)
 {
 	mce_unregister_decode_chain(&fru_mem_poison_nb);
-	deallocate_records();
+	free_records();
 }
 
 module_init(fru_mem_poison_init);
  

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index fc5996feba70..8541cc69c43b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18363,6 +18363,13 @@  F:	drivers/ras/
 F:	include/linux/ras.h
 F:	include/ras/ras_event.h
 
+RAS FRU MEMORY POISON MANAGER (FMPM)
+M:	Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
+M:	Yazen Ghannam <Yazen.Ghannam@amd.com>
+L:	linux-edac@vger.kernel.org
+S:	Maintained
+F:	drivers/ras/amd/fmpm.c
+
 RC-CORE / LIRC FRAMEWORK
 M:	Sean Young <sean@mess.org>
 L:	linux-media@vger.kernel.org
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index 2e969f59c0ca..782951aa302f 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -34,4 +34,17 @@  if RAS
 source "arch/x86/ras/Kconfig"
 source "drivers/ras/amd/atl/Kconfig"
 
+config RAS_FMPM
+	tristate "FRU Memory Poison Manager"
+	default m
+	depends on X86_MCE
+	imply AMD_ATL
+	help
+	  Support saving and restoring memory error information across reboot
+	  cycles using ACPI ERST as persistent storage. Error information is
+	  saved with the UEFI CPER "FRU Memory Poison" section format.
+
+	  Memory may be retired during boot time and run time depending on
+	  platform-specific policies.
+
 endif
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 3fac80f58005..11f95d59d397 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -3,4 +3,5 @@  obj-$(CONFIG_RAS)	+= ras.o
 obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
 obj-$(CONFIG_RAS_CEC)	+= cec.o
 
+obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
 obj-y			+= amd/atl/
diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
new file mode 100644
index 000000000000..077d9f35cc7d
--- /dev/null
+++ b/drivers/ras/amd/fmpm.c
@@ -0,0 +1,776 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * FRU (Field-Replaceable Unit) Memory Poison Manager
+ *
+ * Copyright (c) 2024, Advanced Micro Devices, Inc.
+ * All Rights Reserved.
+ *
+ * Authors:
+ *	Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
+ *	Muralidhara M K <muralidhara.mk@amd.com>
+ *	Yazen Ghannam <Yazen.Ghannam@amd.com>
+ *
+ * Implementation notes, assumptions, and limitations:
+ *
+ * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
+ *   included in the UEFI specification. So they are defined here. Afterwards, they
+ *   may be moved to linux/cper.h, if appropriate.
+ *
+ * - Platforms based on AMD MI300 systems will be the first to use these structures.
+ *   There are a number of assumptions made here that will need to be generalized
+ *   to support other platforms.
+ *
+ *   AMD MI300-based platform(s) assumptions:
+ *   - Memory errors are reported through x86 MCA.
+ *   - The entire DRAM row containing a memory error should be retired.
+ *   - There will be (1) FRU Memory Poison Section per CPER.
+ *   - The FRU will be the CPU Package (Processor Socket).
+ *   - The default number of Memory Poison Descriptor entries should be (8).
+ *   - The Platform will use ACPI ERST for persistent storage.
+ *   - All FRU records should be saved to persistent storage. Module init will
+ *     fail if any FRU record is not successfully written.
+ *
+ * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
+ *   to use this module for other vendors.
+ *
+ * - Boot time memory retirement may occur later than ideal due to dependencies
+ *   on other libraries and drivers. This leaves a gap where bad memory may be
+ *   accessed during early boot stages.
+ *
+ * - Enough memory should be pre-allocated for each FRU record to be able to hold
+ *   the expected number of descriptor entries. This, mostly empty, record is
+ *   written to storage during init time. Subsequent writes to the same record
+ *   should allow the Platform to update the stored record in-place. Otherwise,
+ *   if the record is extended, then the Platform may need to perform costly memory
+ *   management operations on the storage. For example, the Platform may spend time
+ *   in Firmware copying and invalidating memory on a relatively slow SPI ROM.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cper.h>
+#include <linux/ras.h>
+
+#include <acpi/apei.h>
+
+#include <asm/cpu_device_id.h>
+#include <asm/mce.h>
+
+#pragma pack(1)
+
+/* Validation Bits */
+#define FMP_VALID_ARCH_TYPE		BIT_ULL(0)
+#define FMP_VALID_ARCH			BIT_ULL(1)
+#define FMP_VALID_ID_TYPE		BIT_ULL(2)
+#define FMP_VALID_ID			BIT_ULL(3)
+#define FMP_VALID_LIST_ENTRIES		BIT_ULL(4)
+#define FMP_VALID_LIST			BIT_ULL(5)
+
+/* FRU Architecture Types */
+#define FMP_ARCH_TYPE_X86_CPUID_1_EAX	0
+
+/* FRU ID Types */
+#define FMP_ID_TYPE_X86_PPIN		0
+
+/* FRU Memory Poison Section, UEFI vX.Y sec N.X.Z */
+struct cper_sec_fru_mem_poison {
+	u32 checksum;
+	u64 validation_bits;
+	u32 fru_arch_type;
+	u64 fru_arch;
+	u32 fru_id_type;
+	u64 fru_id;
+	u32 nr_entries;
+};
+
+/* FRU Descriptor ID Types */
+#define FPD_HW_ID_TYPE_MCA_IPID		0
+
+/* FRU Descriptor Address Types */
+#define FPD_ADDR_TYPE_MCA_ADDR		0
+
+/* Memory Poison Descriptor, UEFI vX.Y sec N.X.Y */
+struct cper_fru_poison_desc {
+	u64 timestamp;
+	u32 hw_id_type;
+	u64 hw_id;
+	u32 addr_type;
+	u64 addr;
+};
+
+/* Collection of headers and sections for easy pointer use. */
+struct fru_rec {
+	struct cper_record_header	hdr;
+	struct cper_section_descriptor	sec_desc;
+	struct cper_sec_fru_mem_poison	fmp;
+	struct cper_fru_poison_desc	entries[];
+};
+
+/* Reset to default packing */
+#pragma pack()
+
+/*
+ * Pointers to the complete CPER record of each FRU.
+ *
+ * Memory allocation will include padded space for descriptor entries.
+ */
+static struct fru_rec **fru_records;
+
+#define CPER_CREATOR_FMP						\
+	GUID_INIT(0xcd5c2993, 0xf4b2, 0x41b2, 0xb5, 0xd4, 0xf9, 0xc3,	\
+		  0xa0, 0x33, 0x08, 0x75)
+
+#define CPER_SECTION_TYPE_FMP						\
+	GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2,	\
+		  0x12, 0x0a, 0x44, 0x58)
+
+/**
+ * DOC: fru_poison_entries (byte)
+ * Maximum number of descriptor entries possible for each FRU.
+ *
+ * Values between '1' and '255' are valid.
+ * No input or '0' will default to FMPM_DEFAULT_MAX_NR_ENTRIES.
+ */
+static u8 max_nr_entries;
+module_param(max_nr_entries, byte, 0644);
+MODULE_PARM_DESC(max_nr_entries,
+		 "Maximum number of memory poison descriptor entries per FRU");
+
+#define FMPM_DEFAULT_MAX_NR_ENTRIES	8
+
+/* Maximum number of FRUs in the system. */
+static unsigned int max_nr_fru;
+
+/* Total length of record including headers and list of descriptor entries. */
+static size_t max_rec_len;
+
+/*
+ * Protect the local cache and prevent concurrent writes to storage.
+ * This is only needed after init once notifier block registration is done.
+ */
+static DEFINE_MUTEX(fmpm_update_mutex);
+
+#define for_each_fru(i, rec) \
+	for (i = 0; rec = fru_records[i], i < max_nr_fru; i++)
+
+static inline struct cper_sec_fru_mem_poison *get_fmp(struct fru_rec *rec)
+{
+	return &rec->fmp;
+}
+
+static inline struct cper_fru_poison_desc *get_fpd(struct fru_rec *rec, u32 entry)
+{
+	return &rec->entries[entry];
+}
+
+static inline u32 get_fmp_len(struct fru_rec *rec)
+{
+	return rec->sec_desc.section_length - sizeof(struct cper_section_descriptor);
+}
+
+static struct fru_rec *get_fru_record(u64 fru_id)
+{
+	struct fru_rec *rec;
+	unsigned int i;
+
+	for_each_fru(i, rec) {
+		if (get_fmp(rec)->fru_id == fru_id)
+			return rec;
+	}
+
+	pr_debug("Record not found for FRU 0x%016llx", fru_id);
+	return NULL;
+}
+
+/*
+ * Sum up all bytes within the FRU Memory Poison Section including the Memory
+ * Poison Descriptor entries.
+ */
+static u32 do_fmp_checksum(struct cper_sec_fru_mem_poison *fmp, u32 len)
+{
+	u32 checksum = 0;
+	u8 *buf, *end;
+
+	buf = (u8 *)fmp;
+	end = buf + len;
+
+	while (buf < end)
+		checksum += (u8)(*(buf++));
+
+	return checksum;
+}
+
+/* Calculate a new checksum. */
+static u32 get_fmp_checksum(struct fru_rec *rec)
+{
+	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
+	u32 len, checksum;
+
+	len = get_fmp_len(rec);
+
+	/* Get the current total. */
+	checksum = do_fmp_checksum(fmp, len);
+
+	/* Subtract the current checksum from total. */
+	checksum -= fmp->checksum;
+
+	/* Return the compliment value. */
+	return 0 - checksum;
+}
+
+static int update_record_on_storage(struct fru_rec *rec)
+{
+	int ret;
+
+	rec->fmp.checksum = get_fmp_checksum(rec);
+
+	pr_debug("Writing to storage");
+
+	ret = erst_write(&rec->hdr);
+	if (ret)
+		pr_warn("Storage update failed for FRU 0x%016llx", rec->fmp.fru_id);
+
+	return ret;
+}
+
+static void init_fpd(struct cper_fru_poison_desc *fpd,  struct mce *m)
+{
+	memset(fpd, 0, sizeof(struct cper_fru_poison_desc));
+
+	fpd->timestamp	= m->time;
+	fpd->hw_id_type = FPD_HW_ID_TYPE_MCA_IPID;
+	fpd->hw_id	= m->ipid;
+	fpd->addr_type	= FPD_ADDR_TYPE_MCA_ADDR;
+	fpd->addr	= m->addr;
+}
+
+static bool has_valid_entries(u64 valid_bits)
+{
+	if (!(valid_bits & FMP_VALID_LIST_ENTRIES))
+		return false;
+
+	if (!(valid_bits & FMP_VALID_LIST))
+		return false;
+
+	return true;
+}
+
+static bool same_fpd(struct cper_fru_poison_desc *old, struct cper_fru_poison_desc *new)
+{
+	/*
+	 * Ignore timestamp field.
+	 * The same physical error may be reported multiple times due to stuck bits, etc.
+	 *
+	 * Also, order the checks from most->least likely to fail to shortcut the code.
+	 */
+	if (old->addr != new->addr)
+		return false;
+
+	if (old->hw_id != new->hw_id)
+		return false;
+
+	if (old->addr_type != new->addr_type)
+		return false;
+
+	if (old->hw_id_type != new->hw_id_type)
+		return false;
+
+	return true;
+}
+
+static bool is_dup_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *new)
+{
+	unsigned int i;
+
+	for (i = 0; i < get_fmp(rec)->nr_entries; i++) {
+		if (same_fpd(get_fpd(rec, i), new)) {
+			pr_debug("Found duplicate record");
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static void update_fru_record(struct fru_rec *rec, struct mce *m)
+{
+	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
+	struct cper_fru_poison_desc fpd;
+	u32 entry = 0;
+
+	mutex_lock(&fmpm_update_mutex);
+
+	init_fpd(&fpd, m);
+
+	/* This is the first entry, so just save it. */
+	if (!has_valid_entries(fmp->validation_bits))
+		goto save_fpd;
+
+	/* Ignore already recorded errors. */
+	if (is_dup_fpd(rec, &fpd))
+		goto out_unlock;
+
+	if (fmp->nr_entries >= max_nr_entries) {
+		pr_warn("Exceeded number of entries for FRU 0x%016llx", fmp->fru_id);
+		goto out_unlock;
+	}
+
+	entry = fmp->nr_entries;
+
+save_fpd:
+	memcpy(get_fpd(rec, entry), &fpd, sizeof(struct cper_fru_poison_desc));
+
+	fmp->nr_entries		 = entry + 1;
+	fmp->validation_bits	|= FMP_VALID_LIST_ENTRIES;
+	fmp->validation_bits	|= FMP_VALID_LIST;
+
+	pr_debug("Updated FRU 0x%016llx Entry #%u", fmp->fru_id, entry);
+
+	update_record_on_storage(rec);
+
+out_unlock:
+	mutex_unlock(&fmpm_update_mutex);
+}
+
+static void retire_dram_row(u64 addr, u64 id, u32 cpu)
+{
+	struct atl_err a_err;
+
+	memset(&a_err, 0, sizeof(struct atl_err));
+
+	a_err.addr = addr;
+	a_err.ipid = id;
+	a_err.cpu  = cpu;
+
+	amd_retire_dram_row(&a_err);
+}
+
+static int fru_mem_poison_handler(struct notifier_block *nb, unsigned long val, void *data)
+{
+	struct mce *m = (struct mce *)data;
+	struct fru_rec *rec;
+
+	if (!mce_is_memory_error(m))
+		return NOTIFY_DONE;
+
+	retire_dram_row(m->addr, m->ipid, m->extcpu);
+
+	/*
+	 * This should not happen on real errors. But it could happen from
+	 * software error injection, etc.
+	 */
+	rec = get_fru_record(m->ppin);
+	if (!rec)
+		return NOTIFY_DONE;
+
+	update_fru_record(rec, m);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block fru_mem_poison_nb = {
+	.notifier_call  = fru_mem_poison_handler,
+	.priority	= MCE_PRIO_LOWEST,
+};
+
+static u32 get_cpu_from_fru_id(u64 fru_id)
+{
+	unsigned int cpu = 0;
+
+	/* Should there be more robust error handling if none found? */
+	for_each_online_cpu(cpu) {
+		if (topology_ppin(cpu) == fru_id)
+			break;
+	}
+
+	return cpu;
+}
+
+static void retire_mem_fmp(struct fru_rec *rec, u32 nr_entries, u32 cpu)
+{
+	struct cper_fru_poison_desc *fpd;
+	unsigned int i;
+
+	for (i = 0; i < nr_entries; i++) {
+		fpd = get_fpd(rec, i);
+
+		if (fpd->hw_id_type != FPD_HW_ID_TYPE_MCA_IPID)
+			continue;
+
+		if (fpd->addr_type != FPD_ADDR_TYPE_MCA_ADDR)
+			continue;
+
+		retire_dram_row(fpd->addr, fpd->hw_id, cpu);
+	}
+}
+
+static void retire_mem_records(void)
+{
+	struct cper_sec_fru_mem_poison *fmp;
+	struct fru_rec *rec;
+	unsigned int i;
+	u32 cpu;
+
+	for_each_fru(i, rec) {
+		fmp = get_fmp(rec);
+
+		if (!has_valid_entries(fmp->validation_bits))
+			continue;
+
+		cpu = get_cpu_from_fru_id(fmp->fru_id);
+
+		retire_mem_fmp(rec, fmp->nr_entries, cpu);
+	}
+}
+
+/* Set the CPER Record Header and CPER Section Descriptor fields. */
+static void set_rec_fields(struct fru_rec *rec)
+{
+	struct cper_section_descriptor	*sec_desc = &rec->sec_desc;
+	struct cper_record_header	*hdr	  = &rec->hdr;
+
+	memcpy(hdr->signature, CPER_SIG_RECORD, CPER_SIG_SIZE);
+	hdr->revision			= CPER_RECORD_REV;
+	hdr->signature_end		= CPER_SIG_END;
+
+	/*
+	 * Currently, it is assumed that there is one FRU Memory Poison
+	 * section per CPER. But this may change for other implementations.
+	 */
+	hdr->section_count		= 1;
+
+	/* The logged errors are recoverable. Otherwise, they'd never make it here. */
+	hdr->error_severity		= CPER_SEV_RECOVERABLE;
+
+	hdr->validation_bits		= 0;
+	hdr->record_length		= max_rec_len;
+	hdr->creator_id			= CPER_CREATOR_FMP;
+	hdr->notification_type		= CPER_NOTIFY_MCE;
+	hdr->record_id			= cper_next_record_id();
+	hdr->flags			= CPER_HW_ERROR_FLAGS_PREVERR;
+
+	sec_desc->section_offset	= sizeof(struct cper_record_header);
+	sec_desc->section_length	= max_rec_len - sizeof(struct cper_record_header);
+	sec_desc->revision		= CPER_SEC_REV;
+	sec_desc->validation_bits	= 0;
+	sec_desc->flags			= CPER_SEC_PRIMARY;
+	sec_desc->section_type		= CPER_SECTION_TYPE_FMP;
+	sec_desc->section_severity	= CPER_SEV_RECOVERABLE;
+}
+
+static int save_new_records(void)
+{
+	struct fru_rec *rec;
+	unsigned int i;
+	int ret = 0;
+
+	for_each_fru(i, rec) {
+		/* Skip restored records. Should these be fixed up? */
+		if (rec->hdr.record_length)
+			continue;
+
+		set_rec_fields(rec);
+
+		ret = update_record_on_storage(rec);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static bool is_valid_fmp(struct fru_rec *rec)
+{
+	struct cper_sec_fru_mem_poison *fmp = get_fmp(rec);
+	u32 len = get_fmp_len(rec);
+
+	if (!fmp)
+		return false;
+
+	if (!len)
+		return false;
+
+	/* Checksum must sum to zero for the entire section. */
+	if (do_fmp_checksum(fmp, len))
+		return false;
+
+	if (!(fmp->validation_bits & FMP_VALID_ARCH_TYPE))
+		return false;
+
+	if (fmp->fru_arch_type != FMP_ARCH_TYPE_X86_CPUID_1_EAX)
+		return false;
+
+	if (!(fmp->validation_bits & FMP_VALID_ARCH))
+		return false;
+
+	if (fmp->fru_arch != cpuid_eax(1))
+		return false;
+
+	if (!(fmp->validation_bits & FMP_VALID_ID_TYPE))
+		return false;
+
+	if (fmp->fru_id_type != FMP_ID_TYPE_X86_PPIN)
+		return false;
+
+	if (!(fmp->validation_bits & FMP_VALID_ID))
+		return false;
+
+	return true;
+}
+
+static void restore_record(struct fru_rec *new, struct fru_rec *old)
+{
+	/* Records larger than max_rec_len were skipped earlier. */
+	size_t len = min(max_rec_len, old->hdr.record_length);
+
+	memcpy(new, old, len);
+}
+
+static bool valid_record(struct fru_rec *old)
+{
+	struct fru_rec *new;
+
+	if (!is_valid_fmp(old)) {
+		pr_debug("Ignoring invalid record");
+		return false;
+	}
+
+	new = get_fru_record(old->fmp.fru_id);
+	if (!new) {
+		pr_debug("Ignoring record for absent FRU");
+		return false;
+	}
+
+	/* What if ERST has duplicate FRU entries? */
+	restore_record(new, old);
+
+	return true;
+}
+
+/*
+ * Fetch saved records from persistent storage.
+ *
+ * For each found record:
+ * - If it was not created by this module, then ignore it.
+ * - If it is valid, then copy its data to the local cache.
+ * - If it is not valid, then erase it.
+ */
+static int get_saved_records(void)
+{
+	struct fru_rec *old;
+	u64 record_id;
+	int ret, pos;
+	ssize_t len;
+
+	/*
+	 * Assume saved records match current max size.
+	 *
+	 * However, this may not be true depending on module parameters.
+	 */
+	old = kmalloc(max_rec_len, GFP_KERNEL);
+	if (!old) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = erst_get_record_id_begin(&pos);
+	if (ret < 0)
+		goto out_end;
+
+	while (!erst_get_record_id_next(&pos, &record_id)) {
+		/*
+		 * Make sure to clear temporary buffer between reads to avoid
+		 * leftover data from records of various sizes.
+		 */
+		memset(old, 0, max_rec_len);
+
+		len = erst_read_record(record_id, &old->hdr, max_rec_len,
+				       sizeof(struct fru_rec), &CPER_CREATOR_FMP);
+
+		/* Should this be retried if the temporary buffer is too small? */
+		if (len < 0)
+			continue;
+
+		if (!valid_record(old))
+			erst_clear(record_id);
+	}
+
+out_end:
+	erst_get_record_id_end();
+	kfree(old);
+out:
+	return ret;
+}
+
+static void set_fmp_fields(struct cper_sec_fru_mem_poison *fmp, unsigned int cpu)
+{
+	fmp->fru_arch_type    = FMP_ARCH_TYPE_X86_CPUID_1_EAX;
+	fmp->validation_bits |= FMP_VALID_ARCH_TYPE;
+
+	/* Assume all CPUs in the system have the same value for now. */
+	fmp->fru_arch	      = cpuid_eax(1);
+	fmp->validation_bits |= FMP_VALID_ARCH;
+
+	fmp->fru_id_type      = FMP_ID_TYPE_X86_PPIN;
+	fmp->validation_bits |= FMP_VALID_ID_TYPE;
+
+	fmp->fru_id	      = topology_ppin(cpu);
+	fmp->validation_bits |= FMP_VALID_ID;
+}
+
+static unsigned int get_cpu_for_fru_num(unsigned int i)
+{
+	unsigned int cpu = 0;
+
+	/* Should there be more robust error handling if none found? */
+	for_each_online_cpu(cpu) {
+		if (topology_physical_package_id(cpu) == i)
+			return cpu;
+	}
+
+	return cpu;
+}
+
+static void init_fmps(void)
+{
+	struct fru_rec *rec;
+	unsigned int i, cpu;
+
+	for_each_fru(i, rec) {
+		cpu = get_cpu_for_fru_num(i);
+		set_fmp_fields(get_fmp(rec), cpu);
+	}
+}
+
+static int get_system_info(void)
+{
+	u8 model = boot_cpu_data.x86_model;
+
+	/* Only load on MI300A systems for now. */
+	if (!(model >= 0x90 && model <= 0x9f))
+		return -ENODEV;
+
+	if (!cpu_feature_enabled(X86_FEATURE_AMD_PPIN)) {
+		pr_debug("PPIN feature not available");
+		return -ENODEV;
+	}
+
+	/* Use CPU Package (Socket) as FRU for MI300 systems. */
+	max_nr_fru = topology_max_packages();
+	if (!max_nr_fru)
+		return -ENODEV;
+
+	if (!max_nr_entries)
+		max_nr_entries = FMPM_DEFAULT_MAX_NR_ENTRIES;
+
+	max_rec_len  = sizeof(struct fru_rec);
+	max_rec_len += sizeof(struct cper_fru_poison_desc) * max_nr_entries;
+
+	pr_debug("max_nr_fru=%u max_nr_entries=%u, max_rec_len=%lu",
+		 max_nr_fru, max_nr_entries, max_rec_len);
+	return 0;
+}
+
+static void deallocate_records(void)
+{
+	struct fru_rec *rec;
+	int i;
+
+	for_each_fru(i, rec)
+		kfree(rec);
+
+	kfree(fru_records);
+}
+
+static int allocate_records(void)
+{
+	int i, ret = 0;
+
+	fru_records = kcalloc(max_nr_fru, sizeof(struct fru_rec *), GFP_KERNEL);
+	if (!fru_records) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < max_nr_fru; i++) {
+		fru_records[i] = kzalloc(max_rec_len, GFP_KERNEL);
+		if (!fru_records[i]) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+	}
+
+	return ret;
+
+out_free:
+	for (; i >= 0; i--)
+		kfree(fru_records[i]);
+
+	kfree(fru_records);
+out:
+	return ret;
+}
+
+static const struct x86_cpu_id fmpm_cpuids[] = {
+	X86_MATCH_VENDOR_FAM(AMD, 0x19, NULL),
+	{ }
+};
+MODULE_DEVICE_TABLE(x86cpu, fmpm_cpuids);
+
+static int __init fru_mem_poison_init(void)
+{
+	int ret;
+
+	if (!x86_match_cpu(fmpm_cpuids)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (erst_disable) {
+		pr_debug("ERST not available");
+		ret = -ENODEV;
+		goto out;
+	}
+
+	ret = get_system_info();
+	if (ret)
+		goto out;
+
+	ret = allocate_records();
+	if (ret)
+		goto out;
+
+	init_fmps();
+
+	ret = get_saved_records();
+	if (ret)
+		goto out_free;
+
+	ret = save_new_records();
+	if (ret)
+		goto out_free;
+
+	retire_mem_records();
+
+	mce_register_decode_chain(&fru_mem_poison_nb);
+
+	pr_info("FRU Memory Poison Manager initialized");
+	return 0;
+
+out_free:
+	deallocate_records();
+out:
+	return ret;
+}
+
+static void __exit fru_mem_poison_exit(void)
+{
+	mce_unregister_decode_chain(&fru_mem_poison_nb);
+	deallocate_records();
+}
+
+module_init(fru_mem_poison_init);
+module_exit(fru_mem_poison_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("FRU Memory Poison Manager");