[v1,3/3] x86/MCE/AMD: Handle reassigned bit definitions for CS SMCA
Commit Message
Currently, on AMD systems with Scalable MCA (SMCA), each machine check
error of a SMCA bank type has an associated bit position in the bank's
control (CTL) register used for enabling / disabling reporting of the
very error. An error's bit position in the CTL register is also used
during error decoding for offsetting into the corresponding bank's error
description structure. As new errors are being added in newer AMD systems
for existing SMCA bank types, the underlying SMCA architecture guarantees
that the bit positions of existing errors are not altered.
However, on some AMD systems viz. Genoa, some of the existing bit
definitions in the CTL register of the Coherent Slave (CS) SMCA bank type
are reassigned without defining new HWID and McaType. Consequently, the
very errors whose bit definitions have been reassigned in the CTL register
are being erroneously decoded.
As a solution, create a new software defined SMCA bank type by utilizing
one of the hardware-reserved values for HWID. The new SMCA bank type will
only be employed for CS error decoding on affected CPU models.
Additionally, since the existing error description structure for the CS
SMCA bank type is still valid, add new error description structure to
compensate for the reassigned bit definitions.
Signed-off-by: Avadhut Naik <avadnaik@amd.com>
---
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mce/amd.c | 24 +++++++++++++++++++++++-
drivers/edac/mce_amd.c | 26 ++++++++++++++++++++++++++
3 files changed, 50 insertions(+), 1 deletion(-)
Comments
* Avadhut Naik <avadnaik@amd.com> wrote:
> @@ -178,6 +178,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
> { SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
> { SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
> { SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
> + /* Software defined SMCA bank type to handle erratum 1384*/
> + { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) },
>
> /* Unified Memory Controller MCA type */
> { SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
> @@ -259,6 +261,17 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
>
> if (c->x86 == 0x19) {
> switch (c->x86_model) {
> + /*
> + * Per Genoa's revision guide, erratum 1384, some SMCA Extended
> + * Error Codes and SMCA Control bits are incorrect for SMCA CS
> + * bank type.
> + */
> + case 0x10 ... 0x1F:
> + case 0x60 ... 0x7B:
> + case 0xA0 ... 0xAF:
> + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
> + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
Why are we open-coding these types?
Why not use smca_hwid_mcatypes[SMCA_CS_V2], etc.?
> + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
> + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
Ditto.
Thanks,
Ingo
On Tue, Jan 17, 2023 at 10:23:09AM +0100, Ingo Molnar wrote:
>
> * Avadhut Naik <avadnaik@amd.com> wrote:
>
> > @@ -178,6 +178,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
> > { SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
> > { SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
> > { SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
> > + /* Software defined SMCA bank type to handle erratum 1384*/
> > + { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) },
> >
> > /* Unified Memory Controller MCA type */
> > { SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
> > @@ -259,6 +261,17 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
> >
> > if (c->x86 == 0x19) {
> > switch (c->x86_model) {
> > + /*
> > + * Per Genoa's revision guide, erratum 1384, some SMCA Extended
> > + * Error Codes and SMCA Control bits are incorrect for SMCA CS
> > + * bank type.
> > + */
> > + case 0x10 ... 0x1F:
> > + case 0x60 ... 0x7B:
> > + case 0xA0 ... 0xAF:
> > + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
> > + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
>
> Why are we open-coding these types?
>
> Why not use smca_hwid_mcatypes[SMCA_CS_V2], etc.?
>
> > + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
> > + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
>
> Ditto.
>
This code runs before matching on a struct with the enums. It seems simplest
to fixup the hardware-provided value before setting things up rather than
changing things later.
Current code flow:
1) Read HWID/McaType values from hardware.
2) Loop through known types and match on the HWID_MCATYPE() tuple.
3) Continue setup based on matched type and its Linux enum.
New code flow:
1) Read HWID/McaType values from hardware.
--> Fixup HWID/McaType values due to any known hardware quirks.
2) Loop through known types and match on the HWID_MCATYPE() tuple.
3) Continue setup based on matched type and its Linux enum.
What do you think?
Also, a further update (maybe you're alluding to?) is get rid of the struct
smca_hwid and just define an enum with "NAME = HWID_MCATYPE(XXX, YYY)".
The struct smca_hwid had another field that was removed, so it seems
unnecessary at the moment.
Thanks,
Yazen
Hi,
On 1/17/2023 03:23, Ingo Molnar wrote:
>
> * Avadhut Naik <avadnaik@amd.com> wrote:
>
>> @@ -178,6 +178,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
>> { SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
>> { SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
>> { SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
>> + /* Software defined SMCA bank type to handle erratum 1384*/
>> + { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) },
>>
>> /* Unified Memory Controller MCA type */
>> { SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
>> @@ -259,6 +261,17 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
>>
>> if (c->x86 == 0x19) {
>> switch (c->x86_model) {
>> + /*
>> + * Per Genoa's revision guide, erratum 1384, some SMCA Extended
>> + * Error Codes and SMCA Control bits are incorrect for SMCA CS
>> + * bank type.
>> + */
>> + case 0x10 ... 0x1F:
>> + case 0x60 ... 0x7B:
>> + case 0xA0 ... 0xAF:
>> + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
>> + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
>
> Why are we open-coding these types?
>
> Why not use smca_hwid_mcatypes[SMCA_CS_V2], etc.?
If I understood correctly (And please rectify me if I didn't), did you mean using something like the below snippet instead of what I have used?
if (*hwid_mcatype == smca_hwid_mcatypes[SMCA_CS_V2].hwid_mcatype)
*hwid_mcatype = smca_hwid_mcatypes[SMCA_CS_V2_QUIRK].hwid_mcatype;
If yes, then SMCA_CS_V2, SMCA_CS_V2_QUIRK etc. originate from the enum smca_bank_types in arch/x86/include/asm/mce.h.
As the enum stands now, it cannot be used for indexing into the smca_hwid_mcatypes array since it might result in incorrect indexing.
Please advise.
Thanks,
Avadhut Naik
>
>> + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
>> + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
>
> Ditto.
>
> Thanks,
>
> Ingo
On Tue, Jan 17, 2023 at 10:23:09AM +0100, Ingo Molnar wrote:
>
> * Avadhut Naik <avadnaik@amd.com> wrote:
>
> > @@ -178,6 +178,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
> > { SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
> > { SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
> > { SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
> > + /* Software defined SMCA bank type to handle erratum 1384*/
> > + { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) },
> >
> > /* Unified Memory Controller MCA type */
> > { SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
> > @@ -259,6 +261,17 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
> >
> > if (c->x86 == 0x19) {
> > switch (c->x86_model) {
> > + /*
> > + * Per Genoa's revision guide, erratum 1384, some SMCA Extended
> > + * Error Codes and SMCA Control bits are incorrect for SMCA CS
> > + * bank type.
> > + */
> > + case 0x10 ... 0x1F:
> > + case 0x60 ... 0x7B:
> > + case 0xA0 ... 0xAF:
> > + if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
> > + *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
>
> Why are we open-coding these types?
>
> Why not use smca_hwid_mcatypes[SMCA_CS_V2], etc.?
>
Hi Ingo,
Is this what you mean?
if (*hwid_mcatype == smca_hwid_mcatypes[SMCA_CS_V2].hwid_mcatype)
*hwid_mcatype = smca_hwid_mcatypes[SMCA_CS_V2_QUIRK].hwid_mcatype;
I think that's a good idea.
Avadhut,
Can you please make this change here and in the other patch?
Thanks,
Yazen
@@ -308,6 +308,7 @@ enum smca_bank_types {
SMCA_L3_CACHE, /* L3 Cache */
SMCA_CS, /* Coherent Slave */
SMCA_CS_V2,
+ SMCA_CS_V2_QUIRK,
SMCA_PIE, /* Power, Interrupts, etc. */
SMCA_UMC, /* Unified Memory Controller */
SMCA_UMC_V2,
@@ -101,7 +101,7 @@ static struct smca_bank_name smca_names[] = {
[SMCA_EX] = { "execution_unit", "Execution Unit" },
[SMCA_FP] = { "floating_point", "Floating Point Unit" },
[SMCA_L3_CACHE] = { "l3_cache", "L3 Cache" },
- [SMCA_CS ... SMCA_CS_V2] = { "coherent_slave", "Coherent Slave" },
+ [SMCA_CS ... SMCA_CS_V2_QUIRK] = { "coherent_slave", "Coherent Slave" },
[SMCA_PIE] = { "pie", "Power, Interrupts, etc." },
/* UMC v2 is separate because both of them can exist in a single system. */
@@ -178,6 +178,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
{ SMCA_CS, HWID_MCATYPE(0x2E, 0x0) },
{ SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) },
{ SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) },
+ /* Software defined SMCA bank type to handle erratum 1384*/
+ { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) },
/* Unified Memory Controller MCA type */
{ SMCA_UMC, HWID_MCATYPE(0x96, 0x0) },
@@ -259,6 +261,17 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
if (c->x86 == 0x19) {
switch (c->x86_model) {
+ /*
+ * Per Genoa's revision guide, erratum 1384, some SMCA Extended
+ * Error Codes and SMCA Control bits are incorrect for SMCA CS
+ * bank type.
+ */
+ case 0x10 ... 0x1F:
+ case 0x60 ... 0x7B:
+ case 0xA0 ... 0xAF:
+ if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
+ *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
+ break;
/*
* Handle discrepancy in HWID of kernel and MCA_IPID register
* for XGMI Controller SMCA bank type
@@ -270,6 +283,15 @@ static inline void fixup_hwid(unsigned int *hwid_mcatype)
default:
break;
}
+ } else if (c->x86 == 0x1A) {
+ switch (c->x86_model) {
+ case 0x40 ... 0x4F:
+ if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2))
+ *hwid_mcatype = HWID_MCATYPE(0x0, 0x1);
+ break;
+ default:
+ break;
+ }
}
}
@@ -311,6 +311,31 @@ static const char * const smca_cs2_mce_desc[] = {
"Hardware Assert Error",
};
+/*
+ * Per Genoa's revision guide, erratum 1384, existing bit definitions
+ * are reassigned for SMCA CS bank type.
+ */
+static const char * const smca_cs2_quirk_mce_desc[] = {
+ "Illegal Request",
+ "Address Violation",
+ "Security Violation",
+ "Illegal Response",
+ "Unexpected Response",
+ "Request or Probe Parity Error",
+ "Read Response Parity Error",
+ "Atomic Request Parity Error",
+ "SDP read response had no match in the CS queue",
+ "SDP read response had an unexpected RETRY error",
+ "Counter overflow error",
+ "Counter underflow error",
+ "Probe Filter Protocol Error",
+ "Probe Filter ECC Error",
+ "Illegal Request on the no data channel",
+ "Address Violation on the no data channel",
+ "Security Violation on the no data channel",
+ "Hardware Assert Error",
+};
+
static const char * const smca_pie_mce_desc[] = {
"Hardware Assert",
"Register security violation",
@@ -602,6 +627,7 @@ static struct smca_mce_desc smca_mce_descs[] = {
[SMCA_L3_CACHE] = { smca_l3_mce_desc, ARRAY_SIZE(smca_l3_mce_desc) },
[SMCA_CS] = { smca_cs_mce_desc, ARRAY_SIZE(smca_cs_mce_desc) },
[SMCA_CS_V2] = { smca_cs2_mce_desc, ARRAY_SIZE(smca_cs2_mce_desc) },
+ [SMCA_CS_V2_QUIRK] = { smca_cs2_quirk_mce_desc, ARRAY_SIZE(smca_cs2_quirk_mce_desc)},
[SMCA_PIE] = { smca_pie_mce_desc, ARRAY_SIZE(smca_pie_mce_desc) },
[SMCA_UMC] = { smca_umc_mce_desc, ARRAY_SIZE(smca_umc_mce_desc) },
[SMCA_UMC_V2] = { smca_umc2_mce_desc, ARRAY_SIZE(smca_umc2_mce_desc) },