[v2,07/17] kexec: Add documentation for KHO

Message ID 20231222195144.24532-2-graf@amazon.com
State New
Headers
Series None |

Commit Message

Alexander Graf Dec. 22, 2023, 7:51 p.m. UTC
  With KHO in place, let's add documentation that describes what it is and
how to use it.

Signed-off-by: Alexander Graf <graf@amazon.com>
---
 Documentation/kho/concepts.rst   | 88 ++++++++++++++++++++++++++++++++
 Documentation/kho/index.rst      | 19 +++++++
 Documentation/kho/usage.rst      | 57 +++++++++++++++++++++
 Documentation/subsystem-apis.rst |  1 +
 4 files changed, 165 insertions(+)
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst
  

Comments

Rob Herring Jan. 3, 2024, 6:48 p.m. UTC | #1
On Fri, Dec 22, 2023 at 12:52 PM Alexander Graf <graf@amazon.com> wrote:
>
> With KHO in place, let's add documentation that describes what it is and
> how to use it.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> ---
>  Documentation/kho/concepts.rst   | 88 ++++++++++++++++++++++++++++++++
>  Documentation/kho/index.rst      | 19 +++++++
>  Documentation/kho/usage.rst      | 57 +++++++++++++++++++++
>  Documentation/subsystem-apis.rst |  1 +
>  4 files changed, 165 insertions(+)
>  create mode 100644 Documentation/kho/concepts.rst
>  create mode 100644 Documentation/kho/index.rst
>  create mode 100644 Documentation/kho/usage.rst
>
> diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
> new file mode 100644
> index 000000000000..8e4fe8c57865
> --- /dev/null
> +++ b/Documentation/kho/concepts.rst
> @@ -0,0 +1,88 @@
> +.. SPDX-License-Identifier: GPL-2.0-or-later
> +
> +=======================
> +Kexec Handover Concepts
> +=======================
> +
> +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
> +arbitrary properties as well as memory locations - across kexec.
> +
> +It introduces multiple concepts:
> +
> +KHO Device Tree
> +---------------
> +
> +Every KHO kexec carries a KHO specific flattened device tree blob that
> +describes the state of the system. Device drivers can register to KHO to
> +serialize their state before kexec. After KHO, device drivers can read
> +the device tree and extract previous state.

How does this work with kexec when there is also the FDT for the h/w?
The h/w FDT has a /chosen property pointing to this FDT blob?

> +
> +KHO only uses the fdt container format and libfdt library, but does not
> +adhere to the same property semantics that normal device trees do: Properties
> +are passed in native endianness and standardized properties like ``regs`` and
> +``ranges`` do not exist, hence there are no ``#...-cells`` properties.

I think native endianness is asking for trouble. libfdt would need
different swap functions here than elsewhere in the kernel for example
which wouldn't even work. So you are just crossing your fingers that
you aren't using any libfdt functions that swap. And when I sync
dtc/libfdt and that changes, I might break you.

Also, if you want to dump the FDT and do a dtc DTB->DTS pass, it is
not going to be too readable given that outputs swapped 32-bit values
for anything that's a 4 byte multiple.

> +
> +KHO introduces a new concept to its device tree: ``mem`` properties. A
> +``mem`` property can inside any subnode in the device tree. When present,
> +it contains an array of physical memory ranges that the new kernel must mark
> +as reserved on boot. It is recommended, but not required, to make these ranges
> +as physically contiguous as possible to reduce the number of array elements ::
> +
> +    struct kho_mem {
> +            __u64 addr;
> +            __u64 len;
> +    };
> +
> +After boot, drivers can call the kho subsystem to transfer ownership of memory
> +that was reserved via a ``mem`` property to themselves to continue using memory
> +from the previous execution.
> +
> +The KHO device tree follows the in-Linux schema requirements. Any element in
> +the device tree is documented via device tree schema yamls that explain what
> +data gets transferred.

If this is all separate, then I think the schemas should be too. And
then from my (DT maintainer) perspective, you can do whatever you want
here (like FIT images). The dtschema tools are pretty much only geared
for "normal" DTs. A couple of problems come to mind. You can't exclude
or change standard properties. The decoding of the DTB to run
validation assumes big endian. We could probably split things up a
bit, but you may be better off just using jsonschema directly. I'm not
even sure running validation here would that valuable. You have 1
source of code generating the DT and 1 consumer. Yes, there's
different kernel versions to deal with, but it's not 100s of people
creating 1000s of DTs with 100s of nodes.

You might look at the netlink stuff which is using its own yaml syntax
to generate code and jsonschema is used to validate the yaml.

Rob
  
Alexander Graf Jan. 17, 2024, 2:01 p.m. UTC | #2
On 03.01.24 19:48, Rob Herring wrote:
>
> On Fri, Dec 22, 2023 at 12:52 PM Alexander Graf <graf@amazon.com> wrote:
>> With KHO in place, let's add documentation that describes what it is and
>> how to use it.
>>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>> ---
>>   Documentation/kho/concepts.rst   | 88 ++++++++++++++++++++++++++++++++
>>   Documentation/kho/index.rst      | 19 +++++++
>>   Documentation/kho/usage.rst      | 57 +++++++++++++++++++++
>>   Documentation/subsystem-apis.rst |  1 +
>>   4 files changed, 165 insertions(+)
>>   create mode 100644 Documentation/kho/concepts.rst
>>   create mode 100644 Documentation/kho/index.rst
>>   create mode 100644 Documentation/kho/usage.rst
>>
>> diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
>> new file mode 100644
>> index 000000000000..8e4fe8c57865
>> --- /dev/null
>> +++ b/Documentation/kho/concepts.rst
>> @@ -0,0 +1,88 @@
>> +.. SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +=======================
>> +Kexec Handover Concepts
>> +=======================
>> +
>> +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
>> +arbitrary properties as well as memory locations - across kexec.
>> +
>> +It introduces multiple concepts:
>> +
>> +KHO Device Tree
>> +---------------
>> +
>> +Every KHO kexec carries a KHO specific flattened device tree blob that
>> +describes the state of the system. Device drivers can register to KHO to
>> +serialize their state before kexec. After KHO, device drivers can read
>> +the device tree and extract previous state.
> How does this work with kexec when there is also the FDT for the h/w?
> The h/w FDT has a /chosen property pointing to this FDT blob?


Yep, exactly.


>
>> +
>> +KHO only uses the fdt container format and libfdt library, but does not
>> +adhere to the same property semantics that normal device trees do: Properties
>> +are passed in native endianness and standardized properties like ``regs`` and
>> +``ranges`` do not exist, hence there are no ``#...-cells`` properties.
> I think native endianness is asking for trouble. libfdt would need
> different swap functions here than elsewhere in the kernel for example
> which wouldn't even work. So you are just crossing your fingers that
> you aren't using any libfdt functions that swap. And when I sync
> dtc/libfdt and that changes, I might break you.
>
> Also, if you want to dump the FDT and do a dtc DTB->DTS pass, it is
> not going to be too readable given that outputs swapped 32-bit values
> for anything that's a 4 byte multiple.


Yeah, but big endian these days is just a complete waste of brain and 
cpu cycles :). And yes, I don't really want to use any libfdt helper 
functions to read data. I use it only to give me the raw data and take 
it from there.


>
>> +
>> +KHO introduces a new concept to its device tree: ``mem`` properties. A
>> +``mem`` property can inside any subnode in the device tree. When present,
>> +it contains an array of physical memory ranges that the new kernel must mark
>> +as reserved on boot. It is recommended, but not required, to make these ranges
>> +as physically contiguous as possible to reduce the number of array elements ::
>> +
>> +    struct kho_mem {
>> +            __u64 addr;
>> +            __u64 len;
>> +    };
>> +
>> +After boot, drivers can call the kho subsystem to transfer ownership of memory
>> +that was reserved via a ``mem`` property to themselves to continue using memory
>> +from the previous execution.
>> +
>> +The KHO device tree follows the in-Linux schema requirements. Any element in
>> +the device tree is documented via device tree schema yamls that explain what
>> +data gets transferred.
> If this is all separate, then I think the schemas should be too. And
> then from my (DT maintainer) perspective, you can do whatever you want
> here (like FIT images). The dtschema tools are pretty much only geared
> for "normal" DTs. A couple of problems come to mind. You can't exclude
> or change standard properties. The decoding of the DTB to run
> validation assumes big endian. We could probably split things up a
> bit, but you may be better off just using jsonschema directly. I'm not
> even sure running validation here would that valuable. You have 1
> source of code generating the DT and 1 consumer. Yes, there's
> different kernel versions to deal with, but it's not 100s of people
> creating 1000s of DTs with 100s of nodes.
>
> You might look at the netlink stuff which is using its own yaml syntax
> to generate code and jsonschema is used to validate the yaml.


I'm currently a lot more interested in the documentation aspect than in 
the validation, yeah. So I think for v3, I'll just throw the schemas 
into the Documentation/kho directory without any validation. We can 
worry about that later :)

Thanks a lot again for the review!


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
  
Rob Herring Jan. 17, 2024, 4:54 p.m. UTC | #3
On Wed, Jan 17, 2024 at 8:02 AM Alexander Graf <graf@amazon.com> wrote:
>
>
> On 03.01.24 19:48, Rob Herring wrote:
> >
> > On Fri, Dec 22, 2023 at 12:52 PM Alexander Graf <graf@amazon.com> wrote:
> >> With KHO in place, let's add documentation that describes what it is and
> >> how to use it.
> >>
> >> Signed-off-by: Alexander Graf <graf@amazon.com>
> >> ---
> >>   Documentation/kho/concepts.rst   | 88 ++++++++++++++++++++++++++++++++
> >>   Documentation/kho/index.rst      | 19 +++++++
> >>   Documentation/kho/usage.rst      | 57 +++++++++++++++++++++
> >>   Documentation/subsystem-apis.rst |  1 +
> >>   4 files changed, 165 insertions(+)
> >>   create mode 100644 Documentation/kho/concepts.rst
> >>   create mode 100644 Documentation/kho/index.rst
> >>   create mode 100644 Documentation/kho/usage.rst
> >>
> >> diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
> >> new file mode 100644
> >> index 000000000000..8e4fe8c57865
> >> --- /dev/null
> >> +++ b/Documentation/kho/concepts.rst
> >> @@ -0,0 +1,88 @@
> >> +.. SPDX-License-Identifier: GPL-2.0-or-later
> >> +
> >> +=======================
> >> +Kexec Handover Concepts
> >> +=======================
> >> +
> >> +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
> >> +arbitrary properties as well as memory locations - across kexec.
> >> +
> >> +It introduces multiple concepts:
> >> +
> >> +KHO Device Tree
> >> +---------------
> >> +
> >> +Every KHO kexec carries a KHO specific flattened device tree blob that
> >> +describes the state of the system. Device drivers can register to KHO to
> >> +serialize their state before kexec. After KHO, device drivers can read
> >> +the device tree and extract previous state.

Can you avoid calling anything "device tree" as much as possible. We
can't avoid the format is FDT/DTB, but otherwise none of this is
Devicetree as most folks know it. Sure, there can be trees of devices
which are not Devicetree, but this is neither. You could have used
BSON or any hierarchical key-value pair serialization format just as
easily (if we already had a parser in the kernel).

> > How does this work with kexec when there is also the FDT for the h/w?
> > The h/w FDT has a /chosen property pointing to this FDT blob?
>
>
> Yep, exactly.

Those properties need to be documented here[1].

[...]

> >> +KHO introduces a new concept to its device tree: ``mem`` properties. A
> >> +``mem`` property can inside any subnode in the device tree. When present,
> >> +it contains an array of physical memory ranges that the new kernel must mark
> >> +as reserved on boot. It is recommended, but not required, to make these ranges
> >> +as physically contiguous as possible to reduce the number of array elements ::
> >> +
> >> +    struct kho_mem {
> >> +            __u64 addr;
> >> +            __u64 len;
> >> +    };
> >> +
> >> +After boot, drivers can call the kho subsystem to transfer ownership of memory
> >> +that was reserved via a ``mem`` property to themselves to continue using memory
> >> +from the previous execution.
> >> +
> >> +The KHO device tree follows the in-Linux schema requirements. Any element in
> >> +the device tree is documented via device tree schema yamls that explain what
> >> +data gets transferred.
> > If this is all separate, then I think the schemas should be too. And
> > then from my (DT maintainer) perspective, you can do whatever you want
> > here (like FIT images). The dtschema tools are pretty much only geared
> > for "normal" DTs. A couple of problems come to mind. You can't exclude
> > or change standard properties. The decoding of the DTB to run
> > validation assumes big endian. We could probably split things up a
> > bit, but you may be better off just using jsonschema directly. I'm not
> > even sure running validation here would that valuable. You have 1
> > source of code generating the DT and 1 consumer. Yes, there's
> > different kernel versions to deal with, but it's not 100s of people
> > creating 1000s of DTs with 100s of nodes.
> >
> > You might look at the netlink stuff which is using its own yaml syntax
> > to generate code and jsonschema is used to validate the yaml.
>
>
> I'm currently a lot more interested in the documentation aspect than in
> the validation, yeah. So I think for v3, I'll just throw the schemas
> into the Documentation/kho directory without any validation. We can
> worry about that later :)

I'll regret that when I get patches fixing them, but okay.

Rob

[1] https://github.com/devicetree-org/dt-schema/blob/main/dtschema/schemas/chosen.yaml
  
Alexander Graf Jan. 17, 2024, 5 p.m. UTC | #4
On 17.01.24 17:54, Rob Herring wrote:
> On Wed, Jan 17, 2024 at 8:02 AM Alexander Graf <graf@amazon.com> wrote:
>>
>> On 03.01.24 19:48, Rob Herring wrote:
>>> On Fri, Dec 22, 2023 at 12:52 PM Alexander Graf <graf@amazon.com> wrote:
>>>> With KHO in place, let's add documentation that describes what it is and
>>>> how to use it.
>>>>
>>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>>> ---
>>>>    Documentation/kho/concepts.rst   | 88 ++++++++++++++++++++++++++++++++
>>>>    Documentation/kho/index.rst      | 19 +++++++
>>>>    Documentation/kho/usage.rst      | 57 +++++++++++++++++++++
>>>>    Documentation/subsystem-apis.rst |  1 +
>>>>    4 files changed, 165 insertions(+)
>>>>    create mode 100644 Documentation/kho/concepts.rst
>>>>    create mode 100644 Documentation/kho/index.rst
>>>>    create mode 100644 Documentation/kho/usage.rst
>>>>
>>>> diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
>>>> new file mode 100644
>>>> index 000000000000..8e4fe8c57865
>>>> --- /dev/null
>>>> +++ b/Documentation/kho/concepts.rst
>>>> @@ -0,0 +1,88 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0-or-later
>>>> +
>>>> +=======================
>>>> +Kexec Handover Concepts
>>>> +=======================
>>>> +
>>>> +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
>>>> +arbitrary properties as well as memory locations - across kexec.
>>>> +
>>>> +It introduces multiple concepts:
>>>> +
>>>> +KHO Device Tree
>>>> +---------------
>>>> +
>>>> +Every KHO kexec carries a KHO specific flattened device tree blob that
>>>> +describes the state of the system. Device drivers can register to KHO to
>>>> +serialize their state before kexec. After KHO, device drivers can read
>>>> +the device tree and extract previous state.
> Can you avoid calling anything "device tree" as much as possible. We
> can't avoid the format is FDT/DTB, but otherwise none of this is
> Devicetree as most folks know it. Sure, there can be trees of devices
> which are not Devicetree, but this is neither. You could have used
> BSON or any hierarchical key-value pair serialization format just as
> easily (if we already had a parser in the kernel).


I understand and agree - it's been confusing to pretty much everyone who 
was looking at KHO so far. Unfortunately I'm terrible at naming. Do you 
happen to have a good suggestion? :)


>
>>> How does this work with kexec when there is also the FDT for the h/w?
>>> The h/w FDT has a /chosen property pointing to this FDT blob?
>>
>> Yep, exactly.
> Those properties need to be documented here[1].


Oooh, thanks a lot for the pointer! I'll add them :)


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
  

Patch

diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
new file mode 100644
index 000000000000..8e4fe8c57865
--- /dev/null
+++ b/Documentation/kho/concepts.rst
@@ -0,0 +1,88 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+=======================
+Kexec Handover Concepts
+=======================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+It introduces multiple concepts:
+
+KHO Device Tree
+---------------
+
+Every KHO kexec carries a KHO specific flattened device tree blob that
+describes the state of the system. Device drivers can register to KHO to
+serialize their state before kexec. After KHO, device drivers can read
+the device tree and extract previous state.
+
+KHO only uses the fdt container format and libfdt library, but does not
+adhere to the same property semantics that normal device trees do: Properties
+are passed in native endianness and standardized properties like ``regs`` and
+``ranges`` do not exist, hence there are no ``#...-cells`` properties.
+
+KHO introduces a new concept to its device tree: ``mem`` properties. A
+``mem`` property can inside any subnode in the device tree. When present,
+it contains an array of physical memory ranges that the new kernel must mark
+as reserved on boot. It is recommended, but not required, to make these ranges
+as physically contiguous as possible to reduce the number of array elements ::
+
+    struct kho_mem {
+            __u64 addr;
+            __u64 len;
+    };
+
+After boot, drivers can call the kho subsystem to transfer ownership of memory
+that was reserved via a ``mem`` property to themselves to continue using memory
+from the previous execution.
+
+The KHO device tree follows the in-Linux schema requirements. Any element in
+the device tree is documented via device tree schema yamls that explain what
+data gets transferred.
+
+Mem cache
+---------
+
+The new kernel needs to know about all memory reservations, but is unable to
+parse the device tree yet in early bootup code because of memory limitations.
+To simplify the initial memory reservation flow, the old kernel passes a
+preprocessed array of physically contiguous reserved ranges to the new kernel.
+
+These reservations have to be separate from architectural memory maps and
+reservations because they differ on every kexec, while the architectural ones
+get passed directly between invocations.
+
+The less entries this cache contains, the faster the new kernel will boot.
+
+Scratch Region
+--------------
+
+To boot into kexec, we need to have a physically contiguous memory range that
+contains no handed over memory. Kexec then places the target kernel and initrd
+into that region. The new kernel exclusively uses this region for memory
+allocations before it ingests the mem cache.
+
+We guarantee that we always have such a region through the scratch region: On
+first boot, you can pass the ``kho_scratch`` kernel command line option. When
+it is set, Linux allocates a CMA region of the given size. CMA gives us the
+guarantee that no handover pages land in that region, because handover
+pages must be at a static physical memory location and CMA enforces that
+only movable pages can be located inside.
+
+After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and
+instead reuse the exact same region that was originally allocated. This allows
+us to recursively execute any amount of KHO kexecs. Because we used this region
+for boot memory allocations and as target memory for kexec blobs, some parts
+of that memory region may be reserved. These reservations are irrenevant for
+the next KHO, because kexec can overwrite even the original kernel.
+
+KHO active phase
+----------------
+
+To enable user space based kexec file loader, the kernel needs to be able to
+provide the device tree that describes the previous kernel's state before
+performing the actual kexec. The process of generating that device tree is
+called serialization. When the device tree is generated, some properties
+of the system may become immutable because they are already written down
+in the device tree. That state is called the KHO active phase.
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
new file mode 100644
index 000000000000..5e7eeeca8520
--- /dev/null
+++ b/Documentation/kho/index.rst
@@ -0,0 +1,19 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+========================
+Kexec Handover Subsystem
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   concepts
+   usage
+
+.. only::  subproject and html
+
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
new file mode 100644
index 000000000000..5efa2a58f9c3
--- /dev/null
+++ b/Documentation/kho/usage.rst
@@ -0,0 +1,57 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`Documentation/kho/concepts.rst <concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+-------------
+
+KHO is available when the ``CONFIG_KEXEC_KHO`` config option is set to y
+at compile team. Every KHO producer has its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho_scratch`` command
+line parameter set to allocate a scratch region. For example
+``kho_scratch=512M`` will reserve a 512 MiB scratch region on boot.
+
+Perform a KHO kexec
+-------------------
+
+Before you can perform a KHO kexec, you need to move the system into the
+:ref:`Documentation/kho/concepts.rst <KHO active phase>` ::
+
+  $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is available in ``/sys/kernel/kho/dt``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+  # kexec -l Image --initrd=initrd -s
+  # kexec -e
+
+The new kernel will boot up and contain some of the previous kernel's state.
+
+For example, if you enabled ``CONFIG_FTRACE_KHO``, the new kernel will contain
+the old kernel's trace buffers in ``/sys/kernel/debug/tracing/trace``.
+
+Abort a KHO exec
+----------------
+
+You can move the system out of KHO active phase again by calling ::
+
+  $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is no longer available in
+``/sys/kernel/kho/dt``.
diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst
index 930dc23998a0..8207b6514d87 100644
--- a/Documentation/subsystem-apis.rst
+++ b/Documentation/subsystem-apis.rst
@@ -86,3 +86,4 @@  Storage interfaces
    misc-devices/index
    peci/index
    wmi/index
+   kho/index