[0/2] FRU Memory Poison Manager

Message ID 20240214033516.1344948-1-yazen.ghannam@amd.com
Headers
Series FRU Memory Poison Manager |

Message

Yazen Ghannam Feb. 14, 2024, 3:35 a.m. UTC
  Hi all,

This set adds a new module to manage error records on persistent
storage.

Patch 1 moves a function from AMD64 EDAC to the AMD Address Translation
Library. This is needed for patch 2.

Patch 2 adds the new module. This is a near total rewrite based on patch
2 from the following set:
https://lore.kernel.org/r/20231129075034.2159223-1-muralimk@amd.com

I included questions in code comments where I think more attention is
needed.

I'd like to add Murali and Naveen as Co-developers, since this is based
on their work. Also, I kept Naveen as a maintainer in case he's still
interested.

Regarding the old set:
 * Patch 1 exports a new function from the ERST driver. This is not
   necessary.

 * Patch 3 adds a new sysfs interface. This needs more work.

 * Patch 4 old set adds documentation. This needs updating.

I did some basic testing on a 2P server system without ERST support.
Mostly I tried to check out the memory layout of the structures. And I
did some memory error injections to check out the record updating flow.
I did some fixups after testing, so I apologize if I missed anything.

Thanks,
Yazen

Yazen Ghannam (2):
  RAS/AMD/ATL, EDAC/amd64: Move MI300 Row Retirement to ATL
  RAS: Introduce the FRU Memory Poison Manager

 MAINTAINERS                 |   7 +
 drivers/edac/Kconfig        |   1 -
 drivers/edac/amd64_edac.c   |  48 ---
 drivers/ras/Kconfig         |  13 +
 drivers/ras/Makefile        |   1 +
 drivers/ras/amd/atl/Kconfig |   1 +
 drivers/ras/amd/atl/umc.c   |  51 +++
 drivers/ras/amd/fmpm.c      | 776 ++++++++++++++++++++++++++++++++++++
 include/linux/ras.h         |   2 +
 9 files changed, 851 insertions(+), 49 deletions(-)
 create mode 100644 drivers/ras/amd/fmpm.c


base-commit: c2064388aa8765abd7c2c5785e7bfe266a2f6cd3
  

Comments

Borislav Petkov Feb. 14, 2024, 7:52 a.m. UTC | #1
On Tue, Feb 13, 2024 at 09:35:14PM -0600, Yazen Ghannam wrote:
> I included questions in code comments where I think more attention is
> needed.

Lemme look.

> Also, I kept Naveen as a maintainer in case he's still interested.

I don't mind that as long as he responds to bug reports from users and
addresses them in timely manner.

> I did some basic testing on a 2P server system without ERST support.
> Mostly I tried to check out the memory layout of the structures. And I
> did some memory error injections to check out the record updating flow.
> I did some fixups after testing, so I apologize if I missed anything.

Right, I'd like for Murali and/or Naveen to test the final version but
lemme go through them first.

Thx.
  
M K, Muralidhara Feb. 20, 2024, 12:29 p.m. UTC | #2
Hi Boris,

On 2/14/2024 1:22 PM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Tue, Feb 13, 2024 at 09:35:14PM -0600, Yazen Ghannam wrote:
>> I included questions in code comments where I think more attention is
>> needed.
> 
> Lemme look.
> 
>> Also, I kept Naveen as a maintainer in case he's still interested.
> 
> I don't mind that as long as he responds to bug reports from users and
> addresses them in timely manner.
> 
>> I did some basic testing on a 2P server system without ERST support.
>> Mostly I tried to check out the memory layout of the structures. And I
>> did some memory error injections to check out the record updating flow.
>> I did some fixups after testing, so I apologize if I missed anything.
> 
> Right, I'd like for Murali and/or Naveen to test the final version but
> lemme go through them first.
> 
Please include, we have worked previous versions of this patch set.
Co-developed-by: naveenkrishna.chatradhi@amd.com
Signed-off-by: naveenkrishna.chatradhi@amd.com
Co-developed-by: muralidhara.mk@amd.com
Signed-off-by: muralidhara.mk@amd.com
Co-developed-by: sathyapriya.k@amd.com
Signed-off-by: sathyapriya.k@amd.com


> Thx.
> 
> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
  
M K, Muralidhara Feb. 20, 2024, 5:10 p.m. UTC | #3
On 2/20/2024 5:59 PM, M K, Muralidhara wrote:
> Hi Boris,
> 
> On 2/14/2024 1:22 PM, Borislav Petkov wrote:
>> Caution: This message originated from an External Source. Use proper 
>> caution when opening attachments, clicking links, or responding.
>>
>>
>> On Tue, Feb 13, 2024 at 09:35:14PM -0600, Yazen Ghannam wrote:
>>> I included questions in code comments where I think more attention is
>>> needed.
>>
>> Lemme look.
>>
>>> Also, I kept Naveen as a maintainer in case he's still interested.
>>
>> I don't mind that as long as he responds to bug reports from users and
>> addresses them in timely manner.
>>
>>> I did some basic testing on a 2P server system without ERST support.
>>> Mostly I tried to check out the memory layout of the structures. And I
>>> did some memory error injections to check out the record updating flow.
>>> I did some fixups after testing, so I apologize if I missed anything.
>>
>> Right, I'd like for Murali and/or Naveen to test the final version but
>> lemme go through them first.
>>
> Please include, we have worked previous versions of this patch set.
> Co-developed-by: naveenkrishna.chatradhi@amd.com
> Signed-off-by: naveenkrishna.chatradhi@amd.com
> Co-developed-by: muralidhara.mk@amd.com
> Signed-off-by: muralidhara.mk@amd.com
> Co-developed-by: sathyapriya.k@amd.com
> Signed-off-by: sathyapriya.k@amd.com
> 
>
Sorry, Just re-arranging the tags. Please add the below tags

Co-developed-by: naveenkrishna.chatradhi@amd.com
Signed-off-by: naveenkrishna.chatradhi@amd.com
Co-developed-by: muralidhara.mk@amd.com
Signed-off-by: muralidhara.mk@amd.com
Tested-by: sathyapriya.k@amd.com


>> Thx.
>>
>> -- 
>> Regards/Gruss,
>>      Boris.
>>
>> https://people.kernel.org/tglx/notes-about-netiquette
>>