Documentation: Begin a RAS section

Message ID 20231128142049.GTZWX3QQTSaQk/+u53@fat_crate.local
State New
Headers
Series Documentation: Begin a RAS section |

Commit Message

Borislav Petkov Nov. 28, 2023, 2:20 p.m. UTC
  On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> AMD systems with Scalable MCA, each machine check error of a SMCA bank
> type has an associated bit position in the bank's control (CTL) register.

Ontop of this. It is long overdue:

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Tue, 28 Nov 2023 14:37:56 +0100

Add some initial RAS documentation. The expectation is for this to
collect all the user-visible features for interacting with the RAS
features of the kernel.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
 Documentation/index.rst   |  1 +
 2 files changed, 27 insertions(+)
 create mode 100644 Documentation/RAS/ras.rst
  

Comments

Jonathan Corbet Jan. 9, 2024, 5:47 p.m. UTC | #1
Borislav Petkov <bp@alien8.de> writes:

> On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
>> From: Muralidhara M K <muralidhara.mk@amd.com>
>> 
>> AMD systems with Scalable MCA, each machine check error of a SMCA bank
>> type has an associated bit position in the bank's control (CTL) register.
>
> Ontop of this. It is long overdue:
>
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 28 Nov 2023 14:37:56 +0100
>
> Add some initial RAS documentation. The expectation is for this to
> collect all the user-visible features for interacting with the RAS
> features of the kernel.
>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
>  Documentation/index.rst   |  1 +
>  2 files changed, 27 insertions(+)
>  create mode 100644 Documentation/RAS/ras.rst

I wish I'd been copied on this ... I've been working to get a handle on
the top-level Documentation/ directories for a while, and would rather
not see a new one added for this.  Offhand, based on this first
document, it looks like material that belongs under
Documentation/admin-guide; can we move it there, please?

Thanks,

jon
  
Borislav Petkov Jan. 9, 2024, 6:36 p.m. UTC | #2
On Tue, Jan 09, 2024 at 10:47:29AM -0700, Jonathan Corbet wrote:
> I wish I'd been copied on this ... 

linux-doc was CCed:

https://lore.kernel.org/all/20231128142049.GTZWX3QQTSaQk%2F+u53@fat_crate.local/

Or did you prefer you directly?

I've been working to get a handle on
> the top-level Documentation/ directories for a while, and would rather
> not see a new one added for this.  Offhand, based on this first
> document, it looks like material that belongs under
> Documentation/admin-guide; can we move it there, please?

Not really an admin guide thing - yes, based on the current content but
actually, the aim for this is to document all things RAS, so it is more
likely a subsystem thing. And all the subsystems are directories under
Documentation/.

So where do you want me to put it?

Thx.
  
Jonathan Corbet Jan. 9, 2024, 7:44 p.m. UTC | #3
Borislav Petkov <bp@alien8.de> writes:

> On Tue, Jan 09, 2024 at 10:47:29AM -0700, Jonathan Corbet wrote:
>> I wish I'd been copied on this ... 
>
> linux-doc was CCed:
>
> https://lore.kernel.org/all/20231128142049.GTZWX3QQTSaQk%2F+u53@fat_crate.local/
>
> Or did you prefer you directly?

Lots of stuff goes to linux-doc, I can miss things.

Of course, I miss things in my own email too...you know the drill...

> I've been working to get a handle on
>> the top-level Documentation/ directories for a while, and would rather
>> not see a new one added for this.  Offhand, based on this first
>> document, it looks like material that belongs under
>> Documentation/admin-guide; can we move it there, please?
>
> Not really an admin guide thing - yes, based on the current content but
> actually, the aim for this is to document all things RAS, so it is more
> likely a subsystem thing. And all the subsystems are directories under
> Documentation/.
>
> So where do you want me to put it?

The hope with all of this documentation thrashing has been to organize
our docs with the *reader* in mind.  "All things RAS" is convenient for
RAS developers, but not for (say) a sysadmin trying to figure out how to
make use of it.  So I would really rather see RAS documentation placed
under admin-guide or userspace-api as appropriate.

Yes, there is a lot of existing documentation that still doesn't live up
to this idea, but we can try to follow it for new stuff while the rest
is (slowly) fixed up.

Make sense?

Thanks,

jon
  
Borislav Petkov Jan. 9, 2024, 8:04 p.m. UTC | #4
On Tue, Jan 09, 2024 at 12:44:41PM -0700, Jonathan Corbet wrote:
> Of course, I miss things in my own email too...you know the drill...

Yeah, tell me about it.

My train of thought with CCing maintainers in such cases usually is: I'd
CC the mailing list as I don't want to bother the maintainer - she/he gets
too much email anyway and this is an FYI thing anyway so she/he'll find
it in the archives eventually.

> Yes, there is a lot of existing documentation that still doesn't live up
> to this idea, but we can try to follow it for new stuff while the rest
> is (slowly) fixed up.

The problem I see here is that not all of the RAS stuff will be
"admin-guide" stuff but some design decisions we've made. I mean, if it
is a really curious admin, it'll fit her/his alley but it won't be
purely administrative tasks' descriptions.

In the end of the day, I don't really care where it is as long as it is
in one place and we can point people to it and say, here, that's why we
did it the way we did it and what you can do about it.

So I'm fine with admin-guide too - just pointing out a potential issue
I see.

Thx.
  
Borislav Petkov Jan. 24, 2024, 12:40 p.m. UTC | #5
On Tue, Jan 09, 2024 at 09:04:34PM +0100, Borislav Petkov wrote:
> So I'm fine with admin-guide too - just pointing out a potential issue
> I see.

Ok, how does that look like?

I've merged it with ras.rst which we had there already and with some
more new documentation that is coming from:

https://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git/log/?h=edac-amd-atl

Thx.

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Wed, 24 Jan 2024 13:37:52 +0100
Subject: [PATCH] Documentation: Move RAS section to admin-guide

This is where this stuff should be.

Requested-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 Documentation/RAS/index.rst                        | 14 --------------
 .../{ => admin-guide}/RAS/address-translation.rst  |  0
 .../{ => admin-guide}/RAS/error-decoding.rst       |  0
 Documentation/admin-guide/RAS/index.rst            |  7 +++++++
 .../admin-guide/{ras.rst => RAS/main.rst}          | 10 +++++++---
 Documentation/admin-guide/index.rst                |  2 +-
 Documentation/index.rst                            |  1 -
 7 files changed, 15 insertions(+), 19 deletions(-)
 delete mode 100644 Documentation/RAS/index.rst
 rename Documentation/{ => admin-guide}/RAS/address-translation.rst (100%)
 rename Documentation/{ => admin-guide}/RAS/error-decoding.rst (100%)
 create mode 100644 Documentation/admin-guide/RAS/index.rst
 rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)

diff --git a/Documentation/RAS/index.rst b/Documentation/RAS/index.rst
deleted file mode 100644
index 2794c1816e90..000000000000
--- a/Documentation/RAS/index.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================================================
-Reliability, Availability and Serviceability (RAS) features
-===========================================================
-
-This documents different aspects of the RAS functionality present in the
-kernel.
-
-.. toctree::
-   :maxdepth: 2
-
-   error-decoding
-   address-translation
diff --git a/Documentation/RAS/address-translation.rst b/Documentation/admin-guide/RAS/address-translation.rst
similarity index 100%
rename from Documentation/RAS/address-translation.rst
rename to Documentation/admin-guide/RAS/address-translation.rst
diff --git a/Documentation/RAS/error-decoding.rst b/Documentation/admin-guide/RAS/error-decoding.rst
similarity index 100%
rename from Documentation/RAS/error-decoding.rst
rename to Documentation/admin-guide/RAS/error-decoding.rst
diff --git a/Documentation/admin-guide/RAS/index.rst b/Documentation/admin-guide/RAS/index.rst
new file mode 100644
index 000000000000..f4087040a7c0
--- /dev/null
+++ b/Documentation/admin-guide/RAS/index.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. toctree::
+   :maxdepth: 2
+
+   main
+   error-decoding
+   address-translation
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/RAS/main.rst
similarity index 99%
rename from Documentation/admin-guide/ras.rst
rename to Documentation/admin-guide/RAS/main.rst
index 8e03751d126d..7ac1d4ccc509 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -1,8 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 
-============================================
-Reliability, Availability and Serviceability
-============================================
+==================================================
+Reliability, Availability and Serviceability (RAS)
+==================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
 
 RAS concepts
 ************
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index fb40a1f6f79e..dfc06fab9432 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -122,7 +122,7 @@ configure specific aspects of kernel behavior to your liking.
    pmf
    pnp
    rapidio
-   ras
+   RAS/index
    rtc
    serial-console
    svga
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 07f2aa07f0fa..9dfdc826618c 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -113,7 +113,6 @@ to ReStructured Text format, or are simply too old.
    :maxdepth: 1
 
    staging/index
-   RAS/index
 
 
 Translations
  
Borislav Petkov Feb. 5, 2024, 7:41 p.m. UTC | #6
On Wed, Jan 24, 2024 at 01:40:30PM +0100, Borislav Petkov wrote:
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Wed, 24 Jan 2024 13:37:52 +0100
> Subject: [PATCH] Documentation: Move RAS section to admin-guide
> 
> This is where this stuff should be.
> 
> Requested-by: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  Documentation/RAS/index.rst                        | 14 --------------
>  .../{ => admin-guide}/RAS/address-translation.rst  |  0
>  .../{ => admin-guide}/RAS/error-decoding.rst       |  0
>  Documentation/admin-guide/RAS/index.rst            |  7 +++++++
>  .../admin-guide/{ras.rst => RAS/main.rst}          | 10 +++++++---
>  Documentation/admin-guide/index.rst                |  2 +-
>  Documentation/index.rst                            |  1 -
>  7 files changed, 15 insertions(+), 19 deletions(-)
>  delete mode 100644 Documentation/RAS/index.rst
>  rename Documentation/{ => admin-guide}/RAS/address-translation.rst (100%)
>  rename Documentation/{ => admin-guide}/RAS/error-decoding.rst (100%)
>  create mode 100644 Documentation/admin-guide/RAS/index.rst
>  rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)

Now queued.

Thx.
  

Patch

diff --git a/Documentation/RAS/ras.rst b/Documentation/RAS/ras.rst
new file mode 100644
index 000000000000..2556b397cd27
--- /dev/null
+++ b/Documentation/RAS/ras.rst
@@ -0,0 +1,26 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+Reliability, Availability and Serviceability features
+=====================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
+
+Error decoding
+---------------
+
+* x86
+
+Error decoding on AMD systems should be done using the rasdaemon tool:
+https://github.com/mchehab/rasdaemon/
+
+While the daemon is running, it would automatically log and decode
+errors. If not, one can still decode such errors by supplying the
+hardware information from the error::
+
+        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
+
+Also, the user can pass particular family and model to decode the error
+string::
+
+        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 9dfdc826618c..36e61783437c 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -113,6 +113,7 @@  to ReStructured Text format, or are simply too old.
    :maxdepth: 1
 
    staging/index
+   RAS/ras
 
 
 Translations